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Fast  CRCs 
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Abstract — CRCs  have  desirable  properties  for  effective  error  detection.  But  their  software  implementation,  which  relies  on  many  steps 
of  the  polynomial  division,  is  typically  slower  than  other  codes  such  as  weaker  checksums.  A  relevant  question  is  whether  there  are 
some  particular  CRCs  that  have  fast  implementation.  In  this  paper,  we  introduce  such  fast  CRCs  as  well  as  an  effective  technique  to 
implement  them.  For  these  fast  CRCs,  even  without  using  table  lookup,  it  is  possible  either  to  eliminate  or  to  greatly  reduce  many  steps 
of  the  polynomial  division  during  their  computation. 

Index  Terms — Fast  CRC,  low-complexity  CRC,  checksum,  error-detection  code,  Hamming  code,  period  of  polynomial,  fast  software 
implementation. 
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1  Introduction 

his  paper  considers  cyclical  redundancy  checks  (CRCs), 
which  are  effective  for  detecting  errors  in  communica¬ 
tion  and  computer  systems.  An  h- bit  CRC  is  typically 
generated  by  a  binary  polynomial  of  the  form 

M(X)  =  (X  +  l)Mi(X),  (1) 

where  M\  (A)  is  a  primitive  polynomial  of  degree  h—1. 
Existing  CRCs  include  the  CRC-16  generated  by 
A16  +  A15  +  A2  +  1  =  (A  +  1)(A15  +  A  +  1),  and  the 
CRC-CCITT  generated  by  X16  +  X12  +  X5  +  1  = 
(X  +  1)(A15  +A14  +  A13  +  A'12  +  A4  +  A3  +  A2  +  A  +  1). 

The  CRC  generated  by  (1)  has  the  following  desirable 
properties:  1)  its  maximum  length  is  2h~ 1  —  1  bits,  2)  its 
burst-error-detecting  capability  is  b  =  h,  i.e.,  all  error  bursts 
of  length  up  to  h  bits  are  detected,  and  3)  its  minimum 
distance  is  d  =  4,  i.e.,  all  double  errors  and  all  odd  numbers 
of  errors  are  detected.  These  properties  are  called  the 
guaranteed  error-detecting  capability.  The  CRC  may  detect 
other  errors,  but  not  guaranteed,  e.g.,  it  can  detect  a  large 
percentage  of  error  bursts  of  length  greater  than  h  [2],  [10], 
[15].  An  important  problem,  which  is  NP-hard  and  is  not 
addressed  in  this  paper,  is  the  determination  of  the 
undetected  error  probability  of  a  code  [7], 

General-purpose  computers  and  compilers  are  increas¬ 
ingly  faster  and  more  sophisticated.  Software  algorithms 
are  commonly  used  in  operations,  modeling,  simulations, 
and  performance  analysis  of  systems  and  networks.  CRC 
implementation  in  software  is  desirable,  because  many 
computers  do  not  have  hardware  circuits  dedicated  for 
CRC  computation.  However,  software  implementation  of 
typical  CRCs  is  slow,  because  it  relies  on  many  steps  of  the 
polynomial  division  during  CRC  computation.  It  is  this 
speed  limitation  of  CRCs  that  leads  to  the  use  of  checksums 
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(which  are  fast  and  typically  do  not  rely  on  table  lookup)  as 
alternatives  to  CRCs  in  many  high-speed  networking 
applications,  although  checksums  are  weaker  than  CRCs. 
For  example,  the  16-bit  ones-complement  checksum  is  used 
in  Internet  protocol  and  the  Fletcher  checksum  is  used  in 
ISO  [5],  [17].  There  are  also  other  fast  error-detection  codes 
[3],  [4],  [12],  [13],  but  they  do  not  have  all  the  desirable 
properties  of  CRCs. 

A  relevant  question  is  whether  there  is  a  new  family  of 
CRCs  that  are  faster  than  the  existing  CRCs.  In  this  paper, 
we  introduce  such  CRCs,  as  well  as  a  technique  for  their 
efficient  implementation.  For  these  fast  CRCs,  it  is  possible 
either  to  eliminate  or  to  greatly  reduce  many  steps  of  the 
polynomial  division  during  their  computation. 

A  common  existing  technique  for  reducing  the  many  steps 
during  CRC  computation  is  to  use  table  lookup,  which 
requires  extra  memory  [9],  [14],  [15],  [16].  In  contrast,  even 
without  table  lookup,  our  fast  CRCs  require  only  a  small 
number  of  steps  for  their  computation.  Algorithms  that  do 
not  rely  on  table  lookup  have  an  advantage  of  being  less 
dependent  on  issues  such  as  cache  architecture  and  cache 
miss.  In  particular,  it  is  possible  to  use  as  low  as  1 .5  operations 
per  input  message  byte  to  encode  our  fast  64-bit  CRC  (which 
is  implemented  in  C  and  requires  no  table  lookup). 

The  paper  is  organized  as  follows:  In  Section  2,  we 
review  known  facts  about  CRCs,  which  serve  as  the 
background  for  our  discussions.  We  present  several 
different  algorithms  for  computing  CRCs,  some  of  which 
are  designed  especially  for  our  fast  CRCs.  In  Section  3,  we 
identify  the  form  of  the  generator  polynomials  for  the  fast 
CRCs,  and  introduce  a  new  technique  for  their  implementa¬ 
tion.  We  then  determine  their  guaranteed  error-detecting 
capability:  the  minimum  distance,  the  burst-error-detecting 
capability,  and  the  maximum  code  length.  In  Section  4,  we 
discuss  CRC  software  complexity  and  show  that  our  fast 
CRCs  are  typically  faster  than  other  CRCs.  In  Section  5,  we 
present  summaries  and  extensions  of  the  paper. 

1.1  Notation  and  Convention 

In  this  paper,  we  consider  polynomials  that  have  binary 
coefficients  0  and  1.  Thus,  all  polynomial  operations  are 
performed  in  the  binary  field  GF(2),  i.e.,  by  using 
polynomial  arithmetic  modulo  2.  Let  A{ X)  and  M( A)  be 
two  polynomials,  then  R,v/(x)  [A(A)]  denotes  the  remainder 
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polynomial  that  is  obtained  when  A(X)  is  divided  by  M(X). 
We  must  have  degree ( Rm ( v)  [  A  (X)  ] )  <  degree(M(X)). 

An  s-tuple  denotes  a  block  of  s  bits  A=(as_i, 
as-2,  ■  ■  ■  ,ai,ao),  which  is  also  presented  by  the  binary 

polynomial  as_iXs_1  +  as_ 2XS~2  + - h  aiX  +  oo  of  degree 

less  than  s.  We  use  the  closely  related  notation  A( X)  to  denote 
this  polynomial,  i.e.,  A  is  composed  of  the  binary  coefficients 
of  A(X).  Thus,  the  tuple  A  and  the  polynomial  A( X)  are 
equivalent  and  can  be  used  interchangeably.  Typically,  the 
polynomial  notation  is  used  to  describe  the  mathematical 
properties  of  codes,  whereas  the  tuple  notation  is  used  to 
describe  the  algorithmic  properties  (such  as  pseudocodes 
and  computer  programs)  of  codes.  If  Q\ (X)  and  Q2(X)  are 
si-tuple  and  s2-tuple,  respectively,  then  the  (si  +  S2)_tuple 
(Q1(X),Q2(X))  denotes  the  polynomial  Qi(A')A"S2  +  Q 2(X), 
which  is  the  concatenation  of  Q 2(X)  to  Q i(X). 

In  this  paper,  we  are  interested  in  CRCs  that  have  low 
software  complexity.  Software  complexity  of  an  algorithm 
refers  to  the  number  of  operations  (i.e.,  operation  count) 
used  to  implement  the  algorithm  (whereas  hardware  com¬ 
plexity  refers  to  the  number  of  gates  used  to  implement  the 
algorithm).  Suppose  that  we  have  two  CRCs  that  operate 
under  similar  environments  and  use  similar  types  of 
operations,  but  one  CRC  requires  lower  operation  count 
(e.g.,  having  a  smaller  loop)  than  the  other.  It  is  likely  that 
the  CRC  with  lower  operation  count  (i.e.,  lower  software 
complexity)  will  result  in  faster  encoding.  Thus,  complexity 
correlates  with  speed.  However,  the  amount  of  the  correla¬ 
tion  also  depends  on  many  other  complicating  factors  such 
as  memory  speed,  cache  size,  compiler,  operating  system, 
pipelining,  and  CPU  architecture.  A  CRC  is  called  "fast"  if  it 
has  low  software  complexity  and  low  memory  requirement 
(e.g.,  it  requires  no  lookup  table  or  only  a  small  lookup 
table).  A  CRC  is  called  "faster"  than  another  if,  for  a  similar 
level  of  memory  requirement,  it  has  lower  software 
complexity. 

An  algorithm  (or  implementation)  is  called  bitwise  if  it  does 
not  use  table  lookup.  Note  that  a  bitwise  algorithm  does  not 
necessarily  involve  only  bit-by-bit  manipulation  or  computa¬ 
tion.  Fast  checksums  are  typically  bitwise.  Bitwise  algo¬ 
rithms,  which  do  not  rely  on  table  lookup,  have  an  advantage 
of  being  less  dependent  on  issues  such  as  cache  architecture, 
cache  miss,  and  software  code  space.  Ideally,  fast  CRC 
algorithms  should  have  low  complexity  and  be  bitwise.  Thus, 
unless  explicitly  stated,  we  focus  on  bitwise  algorithms  in  this 
paper.  Table-lookup  algorithms  are  presented  in  [19,  Appen¬ 
dix  A]  which  can  be  found  on  the  Computer  Society  Digital 
Library  at  http://doi.ieeecomputersociety.org/10.1109/ 
TC. 2009. 83. 

The  notation  (k,  l7  d)  denotes  a  systematic  code  with  k  = 
the  total  bit  length  of  the  code,  l  =  the  bit  length  of  the  input 
message,  and  d  =  the  minimum  distance  of  the  code.  The 
burst-error-detecting  capability  of  a  code  is  denoted  by  b.  To 
facilitate  cross-references,  we  label  some  blocks  of  text  as 
"Remarks,"  which  are  an  integral  part  of  the  presentation  and 
should  not  be  viewed  as  isolated  observations  or  comments. 

2  CRC  Algorithms 

In  this  section,  we  review  some  known  facts  about  software 
CRC  implementation  (e.g.,  see  [2],  [4],  [6],  [9],  [12],  [14],  [15], 


[16]).  To  lay  a  firm  foundation  for  our  later  discussions,  we 
present  these  facts  in  more  precise  and  general  forms  than 
those  often  seen  in  the  literature.  Our  presentation  is  a 
straightforward  generalization  of  the  results  in  [15]. 

2.1  General  CRC  Theory 

Suppose  that  we  use  an  h- bit  CRC,  generated  by  a 
polynomial  M(X)  of  degree  h,  to  protect  an  input 
message  U(X),  which  has  l  bits.  By  definition,  the  check 
polynomial  P( X)  is  the  remainder  that  is  obtained  by 
dividing  U{X)Xh  by  M(X),  i.e.,  P( X)  =  RM{X)  [U(X)Xh] . 
Because  computers  can  process  tuples  of  bits  (e.g.,  bytes 
or  words)  at  a  time,  codes  having  efficient  software 
implementation  should  be  encoded  on  tuples.  Typical 
modern  processors  can  efficiently  handle  tuples  of  8,  16, 
32,  and  64  bits. 

Let  s  >  0  be  any  positive  integer.  We  can  write 
l  =  r  +  (n  —  l)s,  for  some  n  >  0  and  0  <  r  <  s.  We  then 
process  the  CRC  by  dividing  the  input  message  U (A)  into 
n  tuples.  The  first  tuple  has  r  bits,  and  all  the  other  tuples  have 
s  bits.  Because  r  <  s,  we  can  then  insert  (s  —  r)  zeros  to  the  left 
of  U(X)  to  increase  its  length  from  l  to  l1  =  1  +  s  —  r  =  ns, 
without  affecting  the  CRC  computation,  because  Rm(a)[(0, 
0, . . . ,  0,  U{X))Xh\  =  RM(x){U{X)Xh]  =  P{X).  That  is,  the 
first  tuple  now  also  has  s  bits,  the  (s  —  r)  left-hand  bits  of 
which  are  always  zeros. 

Because  each  tuple  i  has  s  bits,  it  can  be  represented  by  a 
polynomial  QfX)  of  degree  <  s.  Thus,  the  input  message  is 
represented  by  U(  X)  =  (Q0(X),  Qi{X), Qn_ i(X)).  We 
emphasize  that,  for  given  h  and  l,  we  are  free  to  choose  the 
value  of  s  (commonly  chosen  values  are  s  =  8,  16,  32,  and 
64  bits).  As  shown  later,  the  choice  of  s  can  have  significant 
impact  on  CRC  speed. 

Define  UfX)  =  (Q0(X),  Qi(X), . . . ,  Qf  X))  to  be  the  first 
i  +  1  input  tuples,  i.e., 

U0(X)  =  Q0(X) 

Ui(X)  =  (Q0(X),Qi(X)) 

Un-i(  X)  =  (Qo(X),  QfX),  Qn-i(X)) 

=  U(X). 

Thus,  for  i  =  1, 2  . . .  ,n  —  1,  UfX)  is  determined  from 
Ui-i(X)  and  QfX)  by 

Ui{X)  =  (Ui-i(X),Qi(X)) 

=  Ui-1(X)X'  +  Qi(X). 

For  i  =  0, 1, . . . ,  n  —  1,  let  PfX)  be  the  CRC  check 
polynomial  for  the  partial  input  message  Uf  X),  i.e., 

Pl(X)  =  RM{x)[Ul(X)Xh].  (3) 

In  particular,  we  have  Pq(X)  =  R M(x)  \Qo{X)Xh\,  and 

Pn-li X)  =  Rm(A')  [un- l(X)Xh] 

=  Rm(a-)  [U(X)Xh] 

=  P(X), 

which  is  the  CRC  check  polynomial  for  the  entire  input 
message  U(X). 
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1 

B  =  0; 

2 

for  (0  <  i  <  n) 

3 

{ 

4 

A  =  B  +  QiXh~s; 

5 

B  =  Rm  [XXs]; 

6 

} 

7 

P  =  B- 

8 

return  P; 

1 

B  =  0; 

2 

for  (0  <  i  <  n) 

3 

{ 

4 

A  =  BXs~h  +  Qi- 

5 

B  =  RM  [AX'*]; 

6 

} 

7 

P  =  B- 

8 

return  P; 

Fig.  1 .  CRC  Algorithm  1  for  computing  the  check  A-tuple  P  from  the  input 
s-tuples  Q0,  •  ■  • .  Qn-i  (s  <  h). 

Substituting  (2)  into  (3),  we  have 

Pi(X)  =  R mix)  [Ui(X) Xh] 

=  R mix)  [(Ui- i(X)Xs  +  Qi(X))Xh] 

=  R mix)  [(Ui- i(X)Xh)Xs]  +  Rm(x)  [Qi(X)Xh] . 

Using  (3),  we  then  have 

P,  (X)  =  RM(X)  [P,_i  (X)XS]  +  Rm(y)  [Qi  (- X)Xh] 

=  Rm(v)  [Pi- i(X)X'  +  Q,(X)Xh] , 

for  i  =  1, 2, . . . ,  n  —  1.  Note  that  (4)  is  a  straightforward 
generalization  of  a  result  in  [15],  which  deals  with  the 
special  cases  h  =  16  and  s  6  {8, 16}.  Thus,  the  check  tuple 
Pi(X)  is  computed  from  Qi(X)  and  the  previous  check  tuple 
Pi-i(X).  Recall  that  P0(X)  =  Rm(y)  [Q0(X)Xh]  and  P(X)  = 
Pn-[  (X)  is  the  CRC  check  tuple  for  U(X).  Using  (4),  P( X)  is 
then  computed  via  the  following  pseudocode: 


1 

o 

II 

Cq 

2 

for  (0  <  i  <  n) 

3 

P  =  Rm  [PXs  +  QiXh } ; 

4 

return  P; 

Remark  1.  We  now  review  the  computational  complexity  of 
polynomial  division,  which  is  needed  in  CRC  computa¬ 
tion.  Given  two  polynomials  W(X)  and  Y(X),  let 
V(X)  =  Rw(y)[X(X)]  be  the  remainder  polynomial  that 
is  obtained  when  Y( X)  is  divided  by  W(X).  Let  w  and  y 
be  the  degrees  of  W(X)  and  Y (X),  respectively.  If  y  <  w 
(i.e.,  y—w+l<0),  then  V(X)  =  Y(X),  i.e.,  no  poly¬ 
nomial  division  is  needed  to  obtain  the  remainder  V(X). 
If  y  >  w,  we  then  need  a  polynomial  division  that 
requires  a  loop  of  y  —  w  +  1  iterations  to  obtain  the 
remainder  V(X)  (see  [8,  p.  421]).  To  summarize,  the 
polynomial  "long  division"  for  computing  RW{Xj[Y(X)] 
requires  a  loop  of  max(0,  y  —  w  +  1)  iterations. 

2.2  Two  CRC  Algorithms 

From  (4),  we  have 

P,  (X)  =  Rm(y)  [(Pi-i  (X)  +  Qi  (X)Xh-°)Xs]  (5) 

if  s  <  h,  and 

Pi(X)  =  Rm(J)  [(Pl-1(X)Xs~h  +  Qi(X))Xh]  (6) 

if  s  >  h.  The  CRC  algorithms  based  on  (5)  and  (6),  called 
Algorithms  1  and  2,  are  shown  in  Figs.  1  and  2,  respectively. 


Fig.  2.  CRC  Algorithm  2  for  computing  the  check  fi-tuple  P  from  the  input 
s-tuples  Q0, . . . ,  Q„_!  (s  >  h ). 

2.3  Two  Alternative  CRC  Algorithms 

We  now  present  two  alternative  CRC  algorithms,  which 
will  be  applied  to  our  fast  CRCs  (see  Section  3). 

Case  1:  s  <  h.  The  CRC  check  polynomial  Pj( X)  for  the 
partial  input  message  Uj(X)  can  be  split  into  two  parts  as 

Pj(X)  =  (Pyi(X),  Pja(X))  =  P3AX)Xh~s  +  Pja(X),  (7) 

where  Pj3  (X)  and  Pj^(X)  are  polynomials  with 
degree(PJ:i(X))  <  s  and  degree(Py2(X))  <  h  —  s.  That  is, 
Pji(X)  and  Pj2 (X)  are  the  s  left-hand  bits  and  ( h  —  s ) 
right-hand  bits  of  Pj(X),  respectively.  Substituting  (7)  into 
(4),  we  have 

Pi(X)  =  Rm(t)  [(Pi-i, \(X)Xh~s  +  P,_1;2(X))XS] 

+  R MW  [ Q*(x)xh } 

=  R  M(x)  [(Pi-i,i  (JO  +  Qi(x))xh] 

+  RM(x)[Pi-h2(X)Xs]. 

Because  degree ( /(_ i: 2 ( A) Xs )  <  h  =  degree ( M (X)),  we  have 
Rm(.y)  [Pt-li2(X)Xs]  =  Pt-1:2(X)Xs.  Thus, 

Pi(X)  =  RM(.y)  [(Pi-i,i  (X)  +  Qi(X))Xh ]  +  Pj_1,2(X)X®.  (8) 

The  CRC  algorithm  based  on  (8),  called  Algorithm  3,  is 
shown  in  Fig.  3. 

Case  2:  s  >  h.  Multiplying  both  sides  of  (6)  by  Xs~h, 
we  have 

Pi(X)Xs~h  =  (Rm(y)  [(Pi-i(X)Xs~h  +  Qi(X))Xh])Xs~h 
=  R«(y)i-‘  [(Pi- i(X)Xs~h  +  Q,(X))XftXs-h] 

=  Rm(Y)Y>-‘  [(Pi-^xs-b  +  Qi(X))Xs\. 

(9) 


1 

P  =  0; 

2 

for  (0  <  i  <  n) 

3 

{ 

4 

Pi  =  s  left-hand  bits  of  P; 

5 

P2  =  (h  —  s)  right-hand  bits  of  P; 

6 

X  =  Pi  +  Qi ; 

7 

B  =  Rm  [AXh]- 

8 

p  =  b  +  p2xs- 

9 

} 

10 

return  P; 

Fig.  3.  CRC  Algorithm  3  for  computing  the  check  fr-tuple  P  from  the  input 
s-tuples  Qo,  -,  Qn-i  (s  <  h). 


Authorized  licensed  use  limited  to:  NRL.  Downloaded  on  August  28,  2009  at  15:47  from  IEEE  Xplore.  Restrictions  apply. 


1324 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL.  58,  NO.  10,  OCTOBER  2009 


1 

B  =  0; 

2 

for  (0  <  *  <  n) 

3 

{ 

4 

A  =  B  +  Qi-, 

5 

B  =  Rn  [AX’]; 

6 

} 

7 

P  =  h  left-hand  bits  of  B; 

8 

return  P- 

Fig.  4.  CRC  Algorithm  4  for  computing  the  check  A-tuple  P  from  the  input 
s-tuples  Q0,  •  ■  • .  Qn-i  (s  >  h). 

Define  Lj(X)  =  PfX) Xs~h.  From  (9),  we  then  have 

L,(X)  =  R^vliLi-iiX)  +  Qi(X))Xs],  (10) 

where  N(X)  =  M(X)Xs~h.  Thus,  L,(X)  is  computed  from 
Li-i(X)  and  QfX). 

Note  that  L0{X)  =  P0{X)Xs~h ,  where  PAX)  = 
Rjf(i)  [Q0(X)Xh],  We  then  have 

LAX)  =  (! RM{x)[Q0(X)Xh])Xs~h 
=  R-m(j x)x-k  [Qo(X)XhXs  h] 

=  Rjv(*)[QoWX*]. 

Because  L,  (X)  =  Pl(X)Xa^h ,  the  term  P,  (X)  is  obtained  by 
shifting  Li(X)  to  the  right  by  (s  —  h)  bits.  Note  that 
degree  (Li  (X))  <  s.  We  will  show  in  Remark  2  that  computing 
Pi(X)  via  (10)  is  slightly  faster  than  via  (6).  The  CRC  algorithm 
based  on  (10),  called  Algorithm  4,  is  shown  in  Fig.  4. 

Remark  2.  Suppose  that  s  >  h.  The  check  polynomial  P(X)  = 
Pn-i  (X)  =  RM(x)[Un-i(X)Xh]  can  then  be  computed  by 
Algorithm  2  (Fig.  2)  or  by  Algorithm  4  (Fig.  4).  We  now 
show  that,  for  bitwise  implementation,  Algorithm  4  is 
slightly  faster  than  Algorithm  2.  By  comparing  these  two 
algorithms,  we  observe  the  following.  First,  the  computa¬ 
tion  of  Rm(v)  [A(X)X,'[  (Fig.  2)  and  the  computation  of 
Rjv(x)  [A(X)XS]  (Fig.  4)  have  the  same  complexity,  because 
each  requires  s  iterations  (by  Remark  1).  Next,  the  factor 
Xs~h  at  line  4  of  Fig.  2  disappears  from  line  4  of  Fig.  4. 
Finally,  one  extra  operation  is  required  at  line  7  of  Fig.  4  to 
extract  the  h  left-hand  bits  of  the  final  B(X).  The  above 
observations  imply  that  Algorithm  4  requires  n  —  1  fewer 
operations  than  Algorithm  2.  Thus,  for  bitwise  implemen¬ 
tation,  we  will  use  Algorithm  4  when  s  >  h. 

2.4  Basic  CRC  Algorithms 

Given  an  input  message  U (X)  and  a  generator  polynomial 
M(X)  of  degree  h,  Algorithms  1-4  produce  the  same  CRC 
check  tuple  P( X).  That  is,  they  are  four  different  ways  for 
accomplishing  the  same  thing.  The  main  difference  among 
these  algorithms  is  how  the  input  message  is  divided  into 
s-tuples  Qi(X).  Algorithms  1  and  3  are  for  s  <  h,  whereas 
Algorithms  2  and  4  are  for  s  >  h.  As  shown  later,  CRC 
speed  depends  on  the  choice  of  s.  For  flexibility,  we  allow 
the  possibility  that  the  same  CRC  is  used  by  computers  that 
have  different  architectures  and  capabilities.  For  example, 
one  computer  can  choose  a  value  of  s  for  encoding  a 
message  to  transmit  to  another  computer  (with  different 


capabilities),  which  can  choose  a  different  value  of  s  for 
detecting  the  errors  in  the  received  message. 

The  above  CRC  algorithms  require  polynomial  divisions. 
In  particular.  Algorithm  1  requires  the  polynomial  division 
R m(x)[A(X)Xs],  Algorithms  2  and  3  require  the  polynomial 
division  R,m(x)  [A(X)Xh],  and  Algorithm  4  requires  the 
polynomial  division  R,v(y)[A(X)Xs],  To  simplify  the  pre¬ 
sentation,  we  will  use  the  single  notation  B(X)  to  denote  all 
these  polynomial  divisions,  i.e.,  we  define 

(  Rm(.y)  [A(X)Xs]  (Algorithm  1), 

B(X)  =  <  RM(X)[A(X)Xh]  (Algorithms  2  and  3),  (11) 

[  RvpqUWTC]  (Algorithm  4), 

where  N(X)  =  M( X)Xs~h.  Note  that  degree ( A (X))  <  h  in 
Algorithm  1,  and  degree(A(X))  <  s  in  Algorithms  2-4.  As 
seen  in  Figs.  1-4,  CRC  computation  using  any  of  the  above 
four  algorithms  requires  the  computation  of  B( X)  for 
n  times. 

A  known  technique  for  computing  B(X)  is  to  use  the 
polynomial  long  division  algorithm  mentioned  in  Remark  1. 
For  example,  consider  Algorithms  2  and  3.  We  then  have 
B(X)  =  RM(X)[A(X)Xh],  where  degree(A(X))  <  s.  Because 
degree ( A (X)Xh)  <  s  +  h  —  1  and  degree(M(X))  =  h,  by  Re¬ 
mark  1,  B(X)  can  be  computed  via  the  polynomial  long 
division  that  requires  a  loop  of  s  iterations.  Similarly,  it  can 
be  shown  that  computing  B(X)  in  Algorithms  1  and  4  also 
requires  a  loop  of  s  iterations.  That  is,  the  computational 
complexity  for  computing  B(X)  is  O(s). 

Definition  1.  The  technique  for  computing  the  polynomial  B(X) 
as  given  in  (11)  is  called  the  "basic"  technique.  Using  the 
polynomial  long  division,  B(X)  can  be  computed  in 
s  iterations.  An  algorithm  (or  a  CRC)  is  basic  if  it  uses  the 
basic  technique  for  computing  B(X). 

3  Fast  CRCS 

Recall  that  we  are  given  an  input  message  Un-i(X)  = 
(Qo(X),  Qi(X), ...,  Qn—i  (X)),  where  QfX)  is  an  s-tuple.  We 
protect  this  message  by  an  h-bit  CRC  generated  by  a 
polynomial  M(X)  of  degree  h.  The  check  /i-tuple 

P(X)  =  P„_i(X)  =  R  M{x)[Un-i(X)Xh] 

can  be  computed  by  Algorithm  1  or  3  (if  s  <  h ),  or  by 
Algorithm  2  or  4  (if  s  >  h).  We  emphasize  that  each  of  these 
algorithms  requires  the  calculation  of  B(X)  defined  in  (11), 
which  involves  the  polynomial  division. 

3.1  Fast  h-Bit  CRCs 

Our  goal  is  to  find  some  CRCs  that  have  fast  implementa¬ 
tion,  i.e.,  to  find  a  new  family  of  generator  polynomials 
M(X)  for  CRCs  that  have  low  complexity.  Recall  that  the 
CRC  algorithms  (Figs.  1-4)  depend  on  the  term  B(X). 
Computation  of  B(X)  is  also  the  most  expensive  step  in  the 
algorithms.  Thus,  finding  fast  CRCs  requires  finding  the 
polynomials  M(X)  that  yield  fast  computation  of  B(X). 

The  first  technique  for  computing  B(X)  is  the  basic 
technique  in  Definition  1.  Using  the  polynomial  division, 
we  can  compute  B(X)  by  a  loop  of  s  iterations.  In  the 
following,  we  present  the  second  technique,  called  the  new 
technique,  for  computing  B( X).  While  the  new  technique  is 
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applicable  to  any  generator  polynomial  M( X),  it  is  more 
effective  for  some  special  CRC  generator  polynomials, 
called  the  fast  polynomials.  Recall  that  the  basic  CRCs  can 
use  Algorithm  1  or  3  (if  s  <  ft),  or  by  Algorithm  2  or  4  (if 
s  >  ft).  However,  as  seen  in  the  following,  the  fast  CRCs  use 
only  Algorithms  3  (for  s  <  ft)  and  4  (for  s  >  ft)  for  their 
bitwise  implementation. 

We  now  introduce  a  new  family  of  CRCs,  which  are 
generated  by  the  following  polynomials 

Fh(X)  =  Xh  +  X2  +  X  +  1,  (12) 

for  all  ft  >  4.  We  ignore  the  case  ft  =  3,  which  yields  the 
trivial  repetition  code  {(0000),  (1111)}.  We  call  Fh (X)  the 
"fast  polynomial,"  which  can  be  factored  into 

Xh  +  X2  +  X  +1  =  (X  +  1  )Gh-i(X), 

where 


Gm(X)  =Xm  +  XT'1-1  H - b  A'3  +  X2  +  1,  (13) 

i.e.,  Gm( X)  includes  all  the  terms  except  Ah  At  first,  it  is  not 
clear  why  this  particular  polynomial  Fh  (X)  will  speed  up 
the  computation  of  B(X).  We  now  introduce  a  technique 
that  is  applied  to  Fh(X)  to  yield  fast  computation  of  B( X). 
By  considering  Algorithms  3  and  4,  we  have  from  (11) 


,  JR„(I)[4(I)I‘]  if  a  <  ft, 

(  ’  lRff(j)[d(I)Xs]  if  a  >  ft, 


(14) 


where  N( X)  =  M(X)Xs~h,  and  A(X)  is  a  polynomial  of 
degree  less  than  s.  We  now  transform  B(X)  into  a  new  form 
that  will  be  used  by  the  fast  CRCs.  First,  note  that 


R M(X)  [A{X){Xh  +  M(X))\ 

=  Rm(x)  [A(X)Xh]  +  R m{x)[A(X)M(X)\ 
=  R mix)  [ A(X)Xh } 

because  R m^[A(X)M(X)]  =  0.  Similarly,  we  have 
R,v(a0[A(X)(Xs  +  N(X))]  =  Rjvpo  [A(X)XS], 
Thus,  (14)  becomes 


Proof.  Relation  (16)  follows  by  using  (15)  with  M(X)  = 
Fh{X)  and  N(X)  =  Fh(X)Xs~h.  First,  suppose  that 
s  <  ft.  Then,  B( X)  =  RFft(r)[A(X)(X2  +  X  +  1)].  Because 
degree (Fh( X))  =  ft  and  degree} A{X){X2  +  X  +  1))  < 
s  +  2,  from  Remark  1,  B(X)  can  be  computed  with 
max(0,  s  —  ft  +  2)  iterations.  Next,  suppose  that  s  >  ft. 
Then,  B(X)  =  RN(x)[A(X)Xs~h{X2  +  X  +  l)\.  Because 
degree(X(X))  =  s  and  degree(A(X)Xs-ft(A2  +  X  +  1))  < 
2s  —  ft  +  2,  Remark  1  implies  that  B( X)  can  also  be 
computed  with  max(0,  s  —  ft .  +  2)  iterations.  □ 

Let  us  briefly  compare  the  computational  complexity  of 
B(X)  for  1)  the  basic  ft-bit  CRC  generated  by  M(X)  and 
2)  the  fast  ft-bit  CRC  generated  by  F/,( X).  For  the  basic 
CRC,  by  Definition  1,  B( X)  is  computed  via  a  loop  of 
s  iterations,  regardless  of  the  form  of  M(X).  However,  for 
the  fast  CRC,  by  Theorem  1,  B(X)  is  computed  via  a  loop  of 
only  max(0,  s  —  ft  +  2)  iterations.  Thus,  the  fast  CRC  is 
much  faster  than  the  basic  CRC  if  s  is  chosen  such  that 
s  —  ft  +  2  is  much  small  than  s.  Further,  if  s  +  1  <  ft,  then 
max(0,  s  —  ft  +  2)  =  0  and  B{X)  =  A(X)(X2  +  X  +  1),  i.e., 
the  polynomial  division  is  eliminated.  Section  4  presents 
CRC  software  complexity  in  more  detail. 

We  emphasize  that  fast  CRC  denotes  a  CRC  that  meets 
the  following  two  conditions:  1)  the  CRC  is  generated  by  the 
fast  polynomial  Fh(X)  =  Xh  +  X2  +  X  +  1,  and  2)  the 
polynomial  B( X)  is  computed  via  Theorem  1  by  applying 
the  new  technique  (15)  to  Fk(X).  That  is,  fast  CRC  refers  to  a 
CRC  that  is  generated  by  a  specific  polynomial  and  is 
implemented  by  a  specific  technique.  Note  that  a  CRC  that 
meets  only  one  of  the  above  two  conditions  may  not  have 
any  speed  advantage  over  a  basic  CRC.  For  example, 
suppose  that,  instead  of  the  new  technique  (15),  the  basic 
technique  (in  Definition  1)  is  applied  to  the  CRC  generated 
by  the  fast  polynomial  Fh  ( X ) .  This  CRC  is  then  not  different 
from  a  basic  CRC  in  terms  of  computational  complexity. 
Application  of  the  new  technique  to  polynomials  other  than 
FfJX)  is  considered  in  [19,  Appendix  C]. 

To  summarize,  the  fast  ft-bit  CRC  is  generated  by 
Fh(X)  =  Xh  +  X2  +  X  +  1.  Under  bitwise  implementation, 
the  fast  CRC  uses  Algorithm  3  if  s  <  ft  and  Algorithm  4  if 
s  >  ft.  The  term  B(X)  in  these  algorithms  is  given  in 
Theorem  1. 


J  Rm(y)  [A(X)  ( Xh  +  M{X))\  if  a  <  ft, 
1  j  1  R*(x)[4(X)(X‘  +  N(X))}  if  a  >  ft, 


(15) 


where  N(X)  =  M( X)Xs~h. 

Definition  2.  Using  Algorithms  3  and  4,  the  technique  (15)  for 
computing  the  polynomial  B( X)  is  called  the  "new" 
technique.  The  CRC  that  is  generated  by  the  fast  polynomial 
Ff,{X)  =  Xh  +  X2  +  X  +  1  and  uses  the  new  technique  for 
computing  B( X)  is  called  the  fast  h-bit  CRC. 

Theorem  1.  Using  Algorithms  3  and  4,  the  polynomial  B(X)  for 
the  fast  CRC  generated  by  Fh( X)  =  Xh  +  X2  +  X  +  1  is 
given  by 


B{X)  = 


R^X (y) [A{X) (X2  +  X  +  1)]  if  a  <  ft,  , 

Rjv(Y)  [A(X)Xs~h{X2  +  X  +  1)]  if  a  >  ft,  {  > 


where  N(X)  =  Fh( X)Xs~h,  and  A(X)  is  a  polynomial  of 
degree  less  than  s.  Further,  using  the  polynomial  division, 
B( X)  can  be  computed  with  max(0,  s  —  ft  +  2)  iterations. 


3.2  A  Fast  16-Bit  CRC 

We  now  consider  the  important  case  ft  =  16.  Many  CRCs 
(as  well  as  weaker  checksums)  used  in  practice  have 
16  check  bits,  e.g.,  the  CRC-16  and  CRC-CCITT  mentioned 
in  Section  1.  With  a  small  amount  of  overhead,  these  CRCs 
can  have  length  up  to  215  —  1  bits  «  4,096  bytes.  Our  goal 
here  is  to  present  a  concrete  example  of  a  new  16-bit  CRC 
that  is  not  only  much  faster  than  but  also  as  good  as 
existing  16-bit  CRCs. 

Our  new  16-bit  CRC  is  generated  by 

Fi6  ( X)  =  X 16  +  X2  +  X  +  1 ,  (17) 

which  can  be  factored  into 


Fi6(X)  =  (X+l)G15(X), 


where 


G15(X)  =  X15  +  X14  +  •  •  •  +  X3  +  X2  +  1. 
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It  can  be  shown  that  G\^(X)  is  a  primitive  polynomial,  i.e., 
Fie(X)  is  a  product  of  X  +  1  and  a  primitive  polynomial 
(however,  as  seen  later,  this  is  not  true  for  many  values  of  h). 
Thus,  this  fast  16-bit  CRC  also  has  length  up  to  215  —  1  bits. 
Although  the  polynomial  (17)  is  different  from  the  generator 
polynomials  for  existing  16-bit  CRCs,  it  does  generate  a  CRC 
that  has  the  same  guaranteed  error-detecting  capability  as 
existing  16-bit  CRCs.  From  Theorem  1,  we  have 

Tim  =  /  W2  +  x  + 1)]  if  s  <  16, 

1 1  \  Rjv(x)  [A(X)XS~16(X2  +  X  +  1)]  if  a  >  16, 

where  N(X)  =  F16(X)XS~16. 

In  the  following,  we  consider  two  cases:  s  =  8  and 
s  =  16.  First,  assume  that  s  =  8,  i.e.,  the  input  message  is 
organized  in  8-bit  bytes.  Because  s  <  16,  we  have  B(X)  = 
Ri?16pf)[A(X)(X2  +  X  +  1)].  Because  degree(A(X))  <  s  =  8, 
we  have  degree)  A  (X)(X2  +  X  +  1))  <  10,  which  is  smaller 
than  degree(l7i6(X))  =  16.  From  Remark  1,  we  have 

B(X)  =  A(X)(X2  +  X  +  1) 

=  A(X)X2  +  A(X)X+  A(X), 

i.e.,  B(X)  is  simply  the  sum  of  A(X)  and  its  translations. 
Thus,  computing  B( X)  via  the  new  technique  requires  no 
polynomial  division.  In  contrast,  computing  B(X)  via  the 
basic  technique  requires  the  polynomial  division  that  has  a 
loop  of  s  =  8  iterations  (see  Definition  1). 

Next,  assume  that  s  =  h  =  16,  i.e.,  the  input  message  is 
organized  in  16-tuples.  Because  s  =  16,  we  have 
degree(A).Y))  <16  and  degree) A(X) (X2  +  X  +  1))  <  18. 
Thus,  by  Remark  1,  B(X)  is  computed  by  polynomial 
division  that  has  a  loop  of  two  iterations.  This  contrasts 
with  computing  B(X)  via  the  basic  technique,  which 
requires  a  loop  of  s  =  16  iterations  (see  Definition  1).  Thus, 
the  loop  iteration  count  of  our  new  technique  is  less  than 
that  of  the  basic  technique  by  the  factor  of  16/2  =  8. 

To  summarize,  when  the  input  message  is  organized  in 
•s- tuples,  it  is  possible  to  have  a  fast  16-bit  CRC  that  requires 
no  polynomial  division  (when  s  =  8),  or  that  requires  the 
polynomial  division  that  has  only  two  loop  iterations  (when 
s  =  16).  Further,  this  fast  16-bit  CRC  has  the  same  guaran¬ 
teed  error-detecting  capability  as  existing  16-bit  CRCs. 

When  computing  B(X)  via  the  new  technique,  although 
the  case  s  =  16  requires  more  loop  iterations  than  the  case 
s  =  8,  we  will  see  later  in  Section  4  that  the  case  s  =  16  has 
lower  overall  computational  complexity  (i.e.,  lower  overall 
operation  count  per  input  byte).  This  is  because,  when 
s  =  16,  there  is  no  need  to  compute  Ph\  (X)  and  Pj2( X)  as 
defined  in  (7).  Further,  the  overhead  processing  cost  per 
input  byte  when  s  =  16  is  lower  than  when  s  =  8.  The 
C  programs  for  the  fast  16-bit  CRC  are  shown  in  Fig.  8  and 
in  [19,  Appendix  A,  Fig.  12]. 

3.3  Error-Detection  Capability  of  Fast  CRCs 

Recall  that  the  h- bit  CRC  generated  by  M(X)  given  in  (1) 
has  minimum  distance  d  =  4  if  its  total  bit  length  <  2h~ 1  —  1. 
We  define  the  maximum  length  of  an  error-detection  code  to 
be  the  total  bit  length  at  or  below  which  its  minimum 
distance  is  d  >  3,  i.e.,  beyond  which  its  minimum  distance 
will  reduce  to  d  =  2.  Thus,  the  maximum  length  of  the  h- bit 
CRC  generated  by  (1)  is  2h_1  —  1  bits.  In  the  following,  we 
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Fig.  5.  The  period  of  Fh(X)  =  Xh  +  X?  +  X  +  1,  which  equals  the 
maximum  length  of  the  fast  ft.-bit  CRC  generated  by  Fh(X). 

derive  the  maximum  length  of  the  fast  CRCs,  which  is 
based  on  polynomial  periodicity. 

By  definition,  the  period  of  a  polynomial  G(X)  is  the 
smallest  positive  integer  i  such  that  RgpqpC]  =  1.  In 
particular,  if  G( X)  is  the  product  of  X  +  1  and  a  primitive 
polynomial  of  degree  h  —  1,  then  it  can  be  shown  that  G(X) 
has  period  2h~ 1  —  1.  Note  that  some  polynomials,  such  as 
X2 ,  do  not  have  periods. 

The  period  of  the  fast  polynomial  FfJX)  =  Xh  +  X2  + 
X  +  1  can  be  computed  directly  from  definition  (for  small  h) 
or  from  the  technique  in  [1,  Section  6.2].  The  periods  of 
Fh(X),  h  >  4,  are  shown  in  Fig.  5.  The  following  theorems, 
which  are  slight  variations  of  well-known  results  from 
cyclic  codes  [10,  Chapter  4],  show  that  the  maximum  length 
of  a  CRC  equals  the  period  of  its  generator  polynomial. 

Theorem  2.  Let  C  be  a  CRC  generated  by  a  polynomial  M(X)  of 

degree  h  >  3.  Assume  that  X  is  not  a  factor  of  M(X).  Let  n b 
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and  d  be  the  bit  length  and  minimum  distance  of  C, 
respectively.  We  then  have 

1.  d  >  3  if  n b  <  period  of  M(X). 

2.  d  =  2  if  nb>  period  of  M(X). 

3.  C  detects  all  error  bursts  of  length  up  to  h  bits,  i.e., 
b  =  h. 

Proof.  Let  t  be  the  period  of  M(X).  We  must  have  t  >  h.  By 
definition,  each  codeword  of  C  has  the  form 

V(X)  =  U(X)Xh  +  P(X), 

where  U(X)  is  the  polynomial  representing  the  input 
message,  and  P(X)  is  the  check  polynomial.  Because 
P(X)  =  RM(X)  [U(X)Xh],  we  have 

U(  X)Xh  =  K(X)M(X)  +  P(X), 

for  some  polynomial  K(X).  Thus,  we  have 

V(X)  =  JJ(X)Xh  +  P(X)  =  K(X)M(X), 

i.e.,  C  is  a  linear  code.  If  d  =  1,  then  X1  =  K(X)M(X),  for 
some  i.  This  implies  that  M(X)  =  XJ  for  some  j,  which 
contradicts  our  assumption  that  X  is  not  a  factor  of 
M (X).  Thus,  d  >  2. 

1 .  We  now  prove,  by  contradiction,  the  statement  d  > 
3  if  rib  <  period  of  M (X) .  Thus,  suppose  that  there 
is  a  codeword  V (X)  with  length  nb  <  t  and  weight 

2.  Then,  V(X)  =  XJ  +  X1  for  some  i  and  j  such  that 
nb  >  j  >  i  >  0.  Thus,  V(X)  =  Xi(X^~i  +  1). 

We  also  have  V(X)  =  K(X)M(X)  for  some 
polynomial  K( X).  Because  X  is  not  a  factor  of 
M(X)  by  assumption,  M (X)  must  divide  X^~l  +  1, 
i.e.,  Rm(x)  =  1-  Thus,  j  —  i  >t  =  period  of 
M( X).  Then,  j  >t>  nb,  which  contradicts  the 
condition  nb  >  j.  Thus,  all  the  codewords  of  length 
nb  <  t  must  have  weight  >  3,  i.e.,  d  >  3. 

2.  We  construct  a  codeword  with  length  >  t  and 
weight  2  as  follows:  Let  U( X)  =  A^_h.  Then, 
P(X)  =  RM(x)  [ U(X)Xh }  =  Rm(v)  [X1]-  We  have 
P(X)  =  1  because  t  is  the  period  of  M(X).  Thus, 
the  codeword  V(X)  =  U(X)Xh  +  P(X)  =  Xt  +  l 
has  length  t+  1  and  weight  2.  That  is,  d  =  2  if 
nb  >  t. 

3.  The  fact  that  C  detects  all  error  bursts  of  length  up 

to  h  bits  (i.e.,  b  =  h)  is  well  known  [10].  □ 

Theorem  3.  Let  C  be  the  CRC  generated  by  the  fast  polynomial 
Fh(X)  =  Xh  +  X2 3  +  A'  +  1.  Let  nb  and  d  be  the  bit  length 
and  minimum  distance  of  C,  respectively.  We  then  have 

1.  d  =  4  if  rib  <  period  of  Fh(X). 

2.  d  =  2  if  rib  >  period  of  Fh(X). 

3.  C  detects  all  error  bursts  of  length  up  to  h.  bits,  i.e., 
b  =  h. 

Proof.  Let  t  be  the  period  of  F/JX).  From  the  proof  of 
Theorem  2,  every  codeword  of  C  has  the  form  V(X)  = 
K{X)(Xh  +  X2  +  X  +  1)  for  some  polynomial  K(X). 
Thus,  the  codewords  of  C  have  even  weight,  i.e.,  d  is  even. 


Suppose  now  that  the  input  message  is  U(X)  =  1. 
Then,  P(X)  =  Rfa(v)  [^h]  =  X2  +  X  +  1,  which  implies 
V(X)  =  U( X)Xh  +  P( X)  =  Xh  +X2+X  +  1.  That  is,  the 
codeword  V(X)  has  weight  4.  Thus,  d  is  either  2  or  4. 
From  Theorem  2.1,  we  must  have  d  =  4  if  nb  <  t.  From 
Theorem  2.2,  we  must  have  d  =  2  if  nb  >  t.  The  fact  that 
C  detects  all  error  bursts  of  length  up  to  h  bits  is  well 
known  [10].  □ 

Recall  that  the  maximum  length  of  the  fr-bit  CRC  that  is 
generated  by  M(X)  given  in  (1),  which  is  the  product  of 
X  +  1  and  a  primitive  polynomial  of  degree  h.  —  1,  is  2h~x  —  1. 
Fig.  5  shows  that  the  maximum  length  of  the  fast  h- bit  CRC  is 
also  2h_1  —  1  in  many  important  cases,  namely  when 
h=  8, 16, 24, 48, 64, 128.  In  fact,  Fh( X)  =  Xh  +  X2  +  X  +  1  is 
also  the  product  of  X  +  1  and  a  primitive  polynomial  at  these 
values  of  h,  i.e.,  the  polynomial  Gh-i  (X)  in  (13)  is  primitive 
when  h  =  8, 16, 24, 48, 64, 128. 

Fig.  5  also  shows  that  the  maximum  lengths  of  many  fast 
h- bit  CRCs  are  substantially  less  than  the  upper  bound 
2h~ 1  —  1  (e.g.,  when  h  =  12  and  h  =  32).  Flowever,  in  [19, 
Appendix  C],  we  apply  our  new  technique  to  more  general 
generator  polynomials  to  yield  other  fast  CRCs  whose 
maximum  lengths  can  approach  the  upper  bound. 

4  CRC  Software  Complexity 

We  now  analyze  and  compare  CRC  software  complexity. 
Software  complexity  of  an  algorithm  refers  to  the  number  of 
operations  (i.e.,  operation  count)  used  to  implement  the 
algorithm.  Our  goal  in  this  paper  is  to  compute  the  CRC  check 
/i-tuple  P( X)  for  an  input  message  that  consists  of  n  tuples 
Qo(X),  Qi(X), . . . ,  Qn-i(X).  Each  tuple  Qf  X)  has  sbits.  This 
CRC  can  be  either  a  basic  CRC  generated  by  a  polynomial 
M(X)  of  degree  h,  or  the  fast  CRC  generated  by 
Fh(X)  =  Xh  +  X2  +  X  +  1.  For  bitwise  implementation, 
while  Algorithm  1,  2,  3,  or  4  can  be  used  for  the  basic  CRC, 
only  Algorithm  3  or  4  are  used  for  the  fast  CRC.  The  check 
tuple  P{X)  is  computed  by  using  a  loop  that  computes  B(X) 
for  n  times,  where  B(X)  is  given  in  Definition  1  for  the  basic 
CRC  and  in  Theorem  1  for  the  fast  CRC. 

In  this  section,  we  compute  e&  and  e/,  which  denote  the 
software  operation  counts  per  input  byte  required  for 
computing  the  check  tuple  P(X)  for  the  basic  CRC  and  the 
fast  CRC,  respectively.  These  operation  counts  will  then  be 
used  to  compare  the  complexity  among  our  fast  CRCs,  the 
basic  CRCs,  the  other  fast  CRCs  in  [4],  and  the  block-parity 
checksum.  An  error-detection  code  is  said  to  be  "faster" 
than  another  if,  for  a  similar  level  of  memory  requirement,  it 
has  lower  software  complexity. 

4.1  General  Complexity  Analysis 

We  now  provide  the  complexity  analysis  for  the  important 
case  s  =  h  for  the  basic  CRC  and  the  fast  CRC  (other  cases 
can  be  analyzed  similarly).  Both  Algorithms  2  and  4  (shown 
in  Figs.  2  and  4)  then  reduce  to  Fig.  6.  Here,  we  have 

B(X)  =  Rm(x)[A(X)Xs]  (18) 

for  the  basic  CRC  (see  Definition  1),  and 

B(  X)  =  RFh{x)[A(X)(X2  +  X+  1)]  (19) 
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1 

B  =  0: 

2 

for  (0  <  i  <  n) 

3 

{ 

4 

A  =  B  +  Qi\ 

5 

B  f  Rjif  [AA'S] ;  for  basic  CRC 

“  [Rf  [A{X2  +  A  +  1)]  ;  for  fast  CRC 

6 

I 

7 

P  =  B ; 

8 

return  P; 

Let  us  now  determine  the  total  operation  counts  tj  and 
tf  for  computing  the  check  tuple  P(X)  for  the  basic  CRC 
and  the  fast  CRC,  respectively.  The  CRC  algorithm  for 
computing  P(X),  which  is  shown  in  Fig.  6,  has  a  loop  of 
n  iterations.  In  addition  to  the  operation  count  for  B(X), 
there  is  also  one  addition  as  indicated  in  line  4  of  Fig.  6.  Let 
ln  be  the  operation  count  for  the  loop  overhead  shown  at 
line  2  of  Fig.  6.  We  then  have 

tb  =  n[ln  +  1  +  s(u  +  ls)],  (23) 


Fig.  6.  CRC  algorithm  (,s  =  h). 

for  the  fast  CRC  (see  Theorem  1),  where  A(X)  is  a 
polynomial  of  degree  less  than  s.  Note  that  different 
CRC  algorithms  refer  to  different  techniques  for  computing 
B(X).  In  particular,  a  CRC  algorithm  is  called  table  lookup  or 
bitwise,  depending  on  whether  the  term  B( X)  in  the 
algorithm  is  computed  with  or  without  table  lookup.  The 
bitwise  technique  is  presented  in  this  section.  The  table- 
lookup  technique  is  presented  in  [19,  Appendix  A]. 

Remark  3.  The  term  B( X)  =  R m(X)\A{X)Xs]  in  (18)  can  be 
computed  as  follows:  First,  we  write  A( X)XS  = 
(•  •  •  (A(X)X)  ■  ■  -)X.  Thus,  B(X)  can  be  computed  in 
s  iterations  via  the  following  pseudocode: 


1 

for  (0  <  j  <  s ) 

2 

A  =  Rm[AX)- 

3 

B  =  A- 

where  R m[AX\  is  computed  by 


R  m[AX\ 


AX  +  M  if  msb(A)  =  1, 
AX  if  rnsb(A)  =  0, 


(20) 


where  msb(A)  denotes  the  most  significant  bit  of  A.  The 
term  R;W[41]  in  (20)  can  also  be  computed  by  using  a 
table  T[  ]  of  only  two  entries  defined  by  T[0]  =  0  and 
T[l]  =  M.  We  then  have 


tf  —  n(ln  -F  3  -F  2m).  (24) 

The  basic  CRC  and  the  fast  CRC  require  tb  and 
tf  operations,  respectively,  to  compute  the  check  tuple 
P(X)  for  the  input  message  that  has  ns  bits,  i.e.,  tb/lns )  and 
tf /{ns)  operations  are  required  per  input  bit.  Recall  that  ej 
and  ef  denote  the  operation  counts  per  input  8-bit  byte 
required  for  computing  the  check  tuple  P(X),  for  the  basic 
CRC  and  the  fast  CRC,  respectively.  We  then  have  e&  = 
8 tb/{ns)  and  e/  =  8 tf/(ns).  Using  (23)  and  (24),  we  have 

_  8 tb  _  8 [ln  +  1  +  s(m  +  ls)]  . 


eS 


8  tf  8(ln  +  3  +  2m) 

ns  s 


(26) 


_ tb _ In  A  IT  s(u  ~F  ls)  (971 

Of  tf  ln  T"  3  T"  2m 

Simple  estimates  are  tb  ~  nsu  [by  ignoring  ln  +  1  and  ls  in 
(23)]  and  tf  «  n2u  [by  ignoring  ln  +  3  in  (24)].  Substituting 
these  into  (27),  we  have 


eb  tb  s  h 
ef  tf  2  2 


(28) 


i.e.,  the  fast  CRC  is  approximately  h/2  times  faster  than  the 
basic  CRC. 


R  m[AX\  =  AX  +  T[msb(A)].  (21) 

Let  m  be  the  operation  count  required  for  computing 
Rm(.y)[^4(-C)-^]-  Using  Remark  3,  the  operation  count 
required  for  computing  B( X)  in  (18)  for  the  basic  CRC  is 
then  s(u  +  ls),  where  ls  denotes  the  operation  count  for  the 
loop  overhead  shown  at  line  1  of  the  pseudocode  in 
Remark  3  (in  particular,  ls  =  0  if  loop  unrolling  is  used). 

Let  us  now  consider  the  term  B(X)  in  (19)  for  the  fast 
CRC.  We  have 

B(X)  =  RFh{x)  [A(X)X2]  +  RFh{x)  [A(X)X]  +  A(X) 

=  RFk(x)  [B,  (X)X]  +  B,  (A)  +  A(X) , 

where  Bi(X)  =  R Fh(x)[A(X)X],  which  has  operation  count 
u.  After  Bi(X)  is  computed,  Rf(i(x)[-Bi(X)X]  also  has 
operation  count  u.  There  are  also  two  binary  additions 
(i.e.,  two  XOR  operations)  in  (22).  Thus,  the  operation  count 
required  for  computing  B(X)  in  (22)  for  the  fast  CRC  is 
2m  “F  2. 


4.2  CRC  Complexity  Under  C  Implementation 

Figs.  7  and  8  show  the  C  programs  for  the  basic  CRC  and 
the  fast  CRC,  respectively,  which  are  based  on  Fig.  6 
(s  =  h).  For  illustration,  we  let  s  =  16  in  the  figures,  and 
M(X)  =  X16  +  A^15  +  A2  +  1  (which  generates  the  CRC-16) 
in  Fig.  7.  Llowever,  the  following  results  are  also  valid  for 
other  values  of  s  and  other  generator  polynomials. 

We  use  the  following  two  rules  to  count  the  number  of 
software  operations  [19,  Appendix  A]:  (Rl)  The  operation 
count  of  a  program  statement  is  defined  as  the  number  of 
operations,  other  than  the  equal  sign  (= ),  that  appear  in  that 
statement.  (R2)  For  an  if-statement,  we  average  the 
operation  count  of  the  if-statement  and  the  operation  count 
of  its  alternative  (e.g.,  an  else-statement). 

The  nonzero  operation  count  for  each  C  program 
statement  is  recorded  between  the  comment  quotes 
(/  *  */).  The  programs  show  that  ln  =  ls  =  2.  Using  (20)  of 
Remark  3,  we  have  u  =  3  if  msb(A)  =  0  and  m  =  4  if 
msb(A)  =  1.  Using  rule  (R2),  we  have  u  =  3.5  (which  is 
the  average  of  3  and  4),  as  recorded  in  Figs.  7  and  8. 
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unsigned  short  basic_CRC  (int  n,  unsigned  short  *Q) 

{ 

int  i,  j,  s; 

unsigned  short  K,  M,  P; 

s  =  16; 

M  =  0x8005;  /*  M  =  X16+X15+X2+l  V 

K  =  0x8000;  /*  K  =  2h_1,  h=s=16  V 

P  =  0; 

for  (i=0;  i<n;  i=i+l) 

{ 

/*  2  */ 

P  =  P  A  Q[i]; 

/*  1  */ 

for  (j=0;  j<s;  j=j+l) 

t 

/*  2  */ 

if  (  (P&K)  !=  0  )  P  =  (P«l)  A  M; 
else  P  =  P«l; 

} 

} 

return  P; 

} 

/*  3.5  V 

Fig.  7.  C  program  for  the  basic  CRC  ( s  =  ft). 

Substituting  these  values  of  ln,  ls/  and  u  into  (25)  and  (26), 
we  obtain  eb  =  8(3  +  5.5s)/s  and  e/  =  96/s.  Thus,  we  have 


s  =  8 

eb 

ef 

eb/ef 

s  =  16 

eb 

ef 

eb/ef 

s  =  32 

eb 

ef 

eb/ef 

s  =  64 

eb 

ef 

eb/ef 

II 

00 

47.0 

45.5 

44.8 

44.4 

12.0 

28.0 

36.0 

40.0 

3.92 

1.62 

1.24 

1.11 

ft  =  16 

48.0 

45.5 

44.8 

44.4 

10.0 

6.00 

25.0 

34.5 

4.80 

7.58 

1.79 

1.29 

Sr* 

II 

CO 

to 

48.0 

46.0 

44.8 

44.4 

10.0 

5.00 

3.00 

23.5 

4.80 

9.20 

14.9 

1.89 

ft  =  64 

48.0 

46.0 

45.0 

44.4 

10.0 

5.00 

2.50 

1.50 

4.80 

9.20 

18.0 

29.6 

Fig.  9.  Software  complexity  for  the  basic  CRCs  (eb)  and  the  fast 
CRCs  (e/). 


{80/s  if  s  <  ft  —  1, 

100/s  if  s  =  ft  —  1, 

96/s  if  s  =  ft, 

8[12  +  5.5(s  —  ft)]/ s  if  s  >  ft. 


(31) 


eh  _  8(3+  5.5 ft) 
ef  96 


0.25  +  0.458ft, 


(29) 


which  is  within  10  percent  of  (28).  For  example,  let 
s  =  ft  =  16.  Then,  eb  =  8(3  +  5.5  x  16) / 16  =  45.5  and 
e/  =  96/16  =  6.  Thus,  e&/e/  =  45.5/6  =  7.58,  i.e.,  the  fast 
CRC  is  7.58  times  "faster"  than  the  basic  CRC.  Further,  if 
s  =  ft  =  64,  then  e/  =  1.50,  =  44.4,  and  ej/e/  =  29.6,  i.e., 

the  fast  64-bit  CRC  is  29.6  times  faster  than  the  basic 
64-bit  CRC.  These  results  are  recorded  in  Fig.  9. 

We  now  briefly  present  the  complexity  results  for 
s,  ft  G  {8, 16, 32, 64},  but  without  the  restriction  s  =  ft.  From 
(37)  and  (39)  of  [19,  Appendix  A],  we  have 


f  8(4  +  5.5s)/s  if  s  <  ft, 
[8(3  +  5.5s)/s  if  s  >  ft, 


(30) 


unsigned  short  fast_CRC  (int  n,  unsigned  short  *Q) 
{ 

int  i ; 

unsigned  short  A,  C,  F,  K,  P; 

F  -  0x7;  /*  F  =  X16+X2+X+l  */ 

K  =  0x8000;  /*  K  =  2h_1,  h=s=16  */ 

P  =  0; 


for  (i=0;  i<n;  i=i+l) 

/* 

2  */ 

{ 

A  =  P  A  Q[i] ; 

/* 

1  V 

if  (  (A&K)  !=  0  ) 

LL. 

< 

V 

V 

< 

/* 

3.5  */ 

else 

C  =  A«l; 

if  (  (C&K)  !=  0  ) 

P  =  (C«l)  a  p; 

/* 

3.5  */ 

else 

P  =  C«l; 

P  =  p  a  c  a  A; 

/* 

2  V 

} 

return  P; 

i 


Fig.  8.  C  program  for  the  fast  CRC  ( s  =  ft). 


As  an  example,  consider  a  basic  16-bit  CRC  and  the  fast 
16-bit  CRC,  which  are  used  to  protect  an  input  message 
consisting  of  8-bit  bytes,  i.e.,  ft  =  16  and  s  =  8.  From  the 
above  formulas,  we  have  eb  =  8(4  +  5.5  x  8)/8  =  48  and 
ef  =  80/8  =  10.  That  is,  the  basic  CRC  and  the  fast  CRC  use 
48  and  10  operations  per  input  byte,  respectively,  to  compute 
their  check  tuples.  Thus,  we  have  e&/e/  =  48/10  =  4.8,  i.e., 
the  fast  CRC  is  4.8  times  faster  than  the  basic  CRC.  The 
values  of  e&,  e/,  and  e\ ,/e/  for  various  (ft,  s)  pairs  are  recorded 
in  Fig.  9.  The  results  show  that  the  complexity  of  the  basic 
CRCs  is  rather  insensitive  to  the  values  of  ft  and  s,  namely  eb 
varies  from  44.4  to  48  (the  variation  is  only  8.1  percent).  In 
contrast,  the  complexity  of  the  fast  CRCs  is  very  sensitive  to 
the  values  of  ft  and  s,  namely  ej  varies  from  1.50  up  to  40.0. 

For  a  given  ft,  recall  from  Section  2.1  that  we  are  free  to 
choose  the  value  of  s.  The  complexity  of  the  basic  CRCs  is 
rather  insensitive  to  the  choice  of  s.  As  seen  in  Fig.  9,  when 
ft  G  (8, 16, 32, 64},  the  complexity  of  the  fast  CRCs  is  fairly 
low  when  s  <  ft,  and  is  minimized  when  s  =  ft.  When 
ft  ^  (8, 16, 32,  64},  it  is  shown  in  [19,  Appendix  A]  that  the 
complexity  of  the  fast  CRCs  is  minimized  (i.e.,  e/  is 
minimized)  either  at  s  =  ft  or  at  s  =  ft  —  2. 

To  summarize,  we  introduce  the  new  family  of  CRC 
generator  polynomials  that  have  the  explicit  form 
Fb(X)  =  Xh  +  X'2  +  X  +  1,  for  all  ft  >  4,  as  well  as  the 
new  technique  (15)  for  their  implementation.  This  family 
includes  FS(X),  which  generates  the  ATM  CRC-8.  For  this 
particular  CRC,  by  choosing  s  =  ft  =  8,  our  new  technique 
provides  a  new  bitwise  implementation  that  is  3.92  times 
faster  than  the  basic  bitwise  technique  (see  Fig.  9). 

Remark  4.  There  exist  well-known  techniques  for  reducing 
the  operation  counts  used  in  CRC  implementation. 
An  example  is  the  use  of  table  lookup  (at  the  cost  of 
increased  memory  and  cache  usage),  which  is  presented 
in  [19,  Appendix  A],  Note  that,  to  keep  our  C  programs 
compact,  readable,  and  general,  we  ignore  software 
optimization  techniques  (such  as  loop  unrolling)  in  our 
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C  programs.  However,  these  techniques  certainly  can  be 
used  to  reduce  the  operation  counts  in  the  programs.  For 
example,  if  loop  unrolling  is  used  (at  the  cost  of  code  size 
expansion)  in  the  inner  for-loop  of  the  C  program  in 
Fig.  7,  then  the  index  increment  and  the  end-of-loop  test 
are  eliminated,  i.e.,  the  loop  overhead  ls  is  reduced  from 
ls  =  2  to  ls  =  0.  With  loop  unrolling  (i.e.,  ls  =  0),  it  can  be 
shown  that  (30)  and  (31)  reduce  to 

_  (  8(4  +  3.5 s)/s  if  s  <  h, 

6b  ~  {  8(3  +  3.5s)/s  if  s>h, 

{80/s  if  s  <  h  —  1, 

100/s  if  s  =  h  —  1, 

96/s  if  s  =  h , 

8[12  +  3.5(s  —  h)\/s  if  s  >  h. 

4.3  Other  Techniques  for  Error-Detection  Codes 

The  complexity  results  for  the  basic  CRC  algorithm, 
which  are  rather  insensitive  to  the  input  parameters  s,  h, 
and  the  form  of  the  generator  polynomial  M(X),  are 
shown  in  Fig.  9.  In  particular,  when  h  =  16  and  s  =  8,  we 
have  e&  =  48  operations  per  input  byte.  Our  CRC 
software  implementation  in  C  for  this  case  is  shown 
[19,  Appendix  A,  Fig.  11],  which  is  more  efficient  than 
the  one  given  in  [2,  pp.  555-556],  which  has  63  operations 
per  input  byte  according  to  rules  (Rl)  and  (R2). 

There  are  other  CRC  algorithms  that  are  much  faster 
than  the  basic  algorithm.  As  expected,  those  algorithms  are 
effective  for  some  particular  generator  polynomials.  For 
example,  the  clever  "add  and  shift"  algorithm  of  [4]  is  fast 
for  the  CRCs  generated  by  Mi(X)  =  X32  +  X31  +  X8  +  1 
(for  h  =  32)  and  M2(X)  =  X64  +  A63  +  X2  +  1  (for  h  =  64), 
which  are  found  by  computer  search  [4].  According  to  rules 
(Rl)  and  (R2)  for  determining  the  operation  counts,  these 
CRCs  use  20  operations  to  process  each  tuple  of  s  =  32 
input  bits  (see  [4,  Fig.  2]).  Thus,  these  CRCs  use  five 
operations  per  input  byte.  In  contrast,  from  Fig.  9,  for 
s  =  32,  our  fast  CRCs  use  only  3  and  2.5  operations  per 
input  byte  for  h  =  32  and  h  =  64,  respectively.  Thus,  our 
fast  CRCs  are  faster  than  the  above  shift-and-add  CRCs. 
Further,  our  fast  64-bit  CRC  is  even  much  faster  when 
s  =  64,  because  it  uses  only  1.5  operations  per  input  byte 
(see  Fig.  9). 

As  mentioned  in  Section  1,  alternatives  to  CRCs  are 
checksums.  Although  checksums  are  weaker  than  CRCs, 
they  can  be  substantially  faster  than  CRCs.  For  example,  let 
s  =  h  and  consider  the  block-parity  checksum.  The  check 
tuple  P(X)  of  this  checksum  is  simply  the  sum  of  all  the 
input  tuples,  i.e.,  P(X)  =  Qi(X).  As  shown  in  [19, 
Section  B.l],  the  operation  count  per  input  byte  required  for 
computing  P( X)  of  the  checksum  is  e  =  24/s.  From  (31),  the 
fast  CRC  has  e/ =  96/s.  Thus,  e//e  =  96/24  =  4,  i.e.,  the 
checksum  is  four  times  faster  than  the  fast  CRC. 

5  Summary  and  Extension 

Error  control  coding  is  essential  for  reliable  transmission  and 
storage,  and  CRCs  are  known  to  be  effective  for  error 
detection.  In  software,  an  h- bit  CRC  is  typically  implemented 


by  dividing  the  input  message  into  s-tuples  (i.e.,  blocks  of 
s  bits) .  The  output  CRC  check  bits  are  obtained  by  recursively 
carrying  the  polynomial  division  on  these  tuples. 

Thus,  the  crucial  part  in  CRC  computation  is  the 
polynomial  division  on  s-tuples.  For  the  basic  CRCs,  this 
division  requires  s  iterations,  which  may  be  expensive  for 
many  applications.  A  common  technique  for  reducing  the 
many  steps  during  CRC  computation  is  to  use  additional 
memory  in  the  form  of  table  lookup.  In  this  paper,  we 
introduce  the  fast  h- bit  CRCs,  which  are  generated  by 
Fh(X)  =  Xh  +  X2  +  X  +  1,  as  well  as  the  new  technique  (15) 
to  implement  them.  Using  our  fast  CRCs,  the  polynomial 
division  on  s-tuples  requires  only  max(0,  s  —  h  +  2)  itera¬ 
tions,  which  are  much  less  than  the  s  iterations  required  for 
the  basic  CRCs,  as  long  as  s  is  chosen  such  that  s  —  h  +  2  is 
much  smaller  than  s.  We  study  the  computational  complex¬ 
ity  of  the  CRCs,  which  refers  to  the  operation  count  per 
input  byte  required  for  computing  the  CRC  check  tuples. 
Our  fast  CRCs  have  low  complexity  and  require  no  table 
lookup.  For  the  important  case  s  =  h,  the  fast  h- bit  CRCs  are 
approximately  h/2  times  faster  than  the  basic  fi-bit  CRCs. 

As  an  illustration,  we  implement  the  CRCs  in  C 
programming  language,  and  then  study  their  computa¬ 
tional  complexity  for  the  bitwise  technique  (i.e.,  without 
table  lookup).  We  show  that  the  complexity  of  the  fast 
h- bit  CRCs  varies  greatly  with  s,  and  is  minimized  either 
at  s  =  h  —  2  or  at  s  =  h.  In  contrast,  the  complexity  of  the 
basic  h- bit  CRCs  varies  little  with  s.  Because  modern 
computers  typically  process  information  in  bytes  or  words, 
we  also  present  the  complexity  results  when  s  is  restricted 
to  multiples  of  byte  size  and  word  size. 

In  [19],  we  provide  several  extensions  to  the  baseline  ideas 
presented  in  this  paper.  In  particular,  we  present  the  results 
for  CRC  table-lookup  techniques,  which  illustrate  tradeoffs 
between  computational  complexity  and  memory  require¬ 
ment.  We  show  that  when  s  =  h,  the  fast  CRCs  can  be  made 
20  percent  faster  by  using  tables  of  only  four  entries.  We 
apply  our  new  technique  to  some  weaker  CRCs  to  yield  even 
faster  CRCs,  i.e.,  there  are  tradeoffs  between  speed  and 
capability.  Further,  we  use  the  new  technique  to  construct 
some  fast  extended  Hamming  perfect  codes.  In  particular, 
we  construct  h-bii  non-CRC  codes  that  not  only  have  low 
complexity  but  also  have  the  following  optimal  properties. 
They  have  the  minimum  distance  d  =  4,  the  burst-error- 
detecting  capability  b  =  h,  and  the  maximum  code  length 
2,l~ 1 .  We  also  apply  the  new  technique  to  arbitrary  CRCs, 
and  then  determine  the  conditions  under  which  the  new 
technique  remains  effective.  In  particular,  the  new  technique 
is  substantially  faster  than  the  basic  technique  for  the 
CRC-64-ISO  generated  by  X64  +  X4  +  X3  +  X  +  1.  Finally, 
we  show  how  the  CRCs  algorithms,  which  are  originally 
designed  for  sequential  implementation  on  a  single  proces¬ 
sor,  can  be  adapted  for  parallel  implementation  on  multiple 
processors. 
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APPENDIX  A  CRC  SOFTWARE  IMPLEMENTATION  AND  COMPLEXITY  EVALUATION 

The  purpose  of  this  appendix  is  to  present  software  implementation  for  the  CRC  algorithms  as  well  as 
to  evaluate  their  computational  complexity.  Software  complexity  of  an  algorithm  refers  to  the  number 
of  operations  (i.e.,  operation  count)  used  to  implement  the  algorithm.  Consider  an  h- bit  CRC,  which  is 
generated  by  a  polynomial  M(X)  of  degree  h.  Our  goal  is  to  compute  the  check  /i-tuple  P{X)  for  an  input 
message  that  consists  of  n  tuples  Q0(X),Qi(X), . . . ,  Qn_i(X).  Each  tuple  QftX )  has  s  bits. 

The  CRC  can  be  implemented  by  any  of  the  4  algorithms  shown  in  Figs.  1-4.  Although  the  value  of  h  is 
fixed,  we  are  free  to  choose  the  value  of  s.  Algorithms  1  and  3  are  for  s  <  h,  whereas  Algorithms  2  and  4 
are  for  s  >  h.  One  algorithm  can  be  faster  than  another,  depending  the  value  of  s  and  the  form  of  M(X). 
For  example,  Remark  2  shows  that,  for  bitwise  implementation,  Algorithm  4  is  faster  than  Algorithm  2 
when  s  >  h.  Thus,  we  will  use  Algorithm  4  for  bitwise  implementation  when  s  >  h,  as  indicated  in  Fig.  10. 
As  stated  in  Theorem  1,  Algorithm  3  must  be  used  for  the  fast  CRCs  when  s  <  h.  Fig.  10  lists  the  CRC 
algorithms  that  are  used  in  our  software  implementation. 

We  recognize  that  accurate  software  evaluation  is  complicated,  and  requires  experiments  with  differ¬ 
ent  processors,  memory  organizations,  programming  languages,  and  compilers.  Other  complicating  factors 
include  programming  styles  and  the  extend  the  CRCs  must  share  with  (or  compete  against)  other  concur- 
rent/interupting  programs. 

Instead  of  dealing  with  the  complex  issues  mentioned  above,  which  are  beyond  the  scope  of  this  paper,  we 
simply  use  software  operation  counts  for  our  complexity  evaluation.  Our  technique  of  software  comparison  is 
as  follows.  We  write  a  program  (e.g.,  in  C)  for  each  CRC.  We  then  use  the  operation  count  as  the  primary 
measure  of  complexity,  and  a  CRC  is  said  to  be  “faster”  than  another  if  it  has  lower  operation  count. 

We  now  determine  the  software  complexity  of  the  CRC  algorithms,  which  refers  to  the  operation  count 
per  input  message  byte  required  for  computing  the  check  h- tuple.  Let  us  examine  Algorithms  1-4  (shown 
in  Figs.  1-4).  For  each  algorithm,  the  check  tuple  P(X)  is  computed  by  using  a  loop  that  computes  B(X) 
for  n  times,  where  B(X)  is  given  in  Definition  1  for  the  basic  CRC  and  in  Theorem  1  for  the  fast  CRC. 
In  addition  to  B(X),  we  also  need  to  compute  all  the  other  terms  inside  the  loop  (which  include  the  loop 
overhead).  Let  r  and  x  be  the  operation  counts  required  for  computing  B{X)  and  the  other  terms  inside  the 
loop,  respectively.  Let  y  be  the  operation  count  required  for  computing  the  terms  outside  the  loop.  Further, 
for  each  algorithm,  let  t  be  the  total  operation  count  required  for  computing  the  CRC  check  tuple  from  the 
input  message  that  consists  of  n  tuples.  We  then  have  t  =  (x  +  r)n  +  y. 

Let  e  be  the  operation  count  per  input  byte  required  for  computing  the  check  /i-tuple.  Each  byte  has  8 
bits.  Because  t  is  the  operation  count  for  computing  the  check  /i-tuple  from  the  ns  input  message  bits,  we 
have  e  =  8 t/(ns)  =  8[(x  +  r)n  +  y\/(ns),  i.e., 


8(x  +  r)  8y 
s  ns 


(32) 


In  the  following,  we  consider  h,  s ,  and  n  to  be  independent  variables,  and  our  goal  is  to  compute  e  in  terms 
of  h,  s,  and  n  for  both  the  basic  CRCs  and  the  fast  CRCs.  That  is,  we  can  write  e  =  e(s,  h,  n).  To  compute 
e,  we  need  to  determine  r,  x,  and  y,  to  which  we  add  the  subscripts  b  and  /  when  they  refer  to  the  basic 
CRCs  and  the  fast  CRCs,  respectively.  That  is,  r^,  Xb,  yb,  and  refer  to  the  basic  CRCs,  while  77,  a 7,  y/, 
and  ef  refer  to  the  fast  CRCs. 

We  present  CRC  implementation  with  and  without  table  lookup.  Our  software  programs  are  for  ie-bit 
computers  that  satisfy  s  <  w  and  h  <  w  (however,  we  allow  the  possibility  that  h  +  s  >  w).  For  example, 
32-bit  computers  are  for  s,  h  <  32  bits,  while  64-bit  computers  are  for  s,  h  <  64  bits  (future  128-bit  computers 
are  for  s,h  <  128  bits).  To  be  specific,  we  implement  the  CRC  algorithms  in  C,  which  is  a  highly  portable 
general-purpose  computer  programming  language  (certainly,  they  can  also  be  implemented  in  other  computer 
languages).  We  use  the  following  2  simple  rules  to  count  the  number  of  software  operations  [13]: 

(Rl)  The  operation  count  of  a  program  statement  is  defined  as  the  number  of  operations,  other  than  the 
equal  sign  (=),  that  appear  in  that  statement. 

(R2)  For  an  if-statement,  we  average  the  operation  count  of  the  if-statement  and  the  operation  count  of  its 
alternative  (e.g.,  an  else-statement). 
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Let  us  consider  examples  on  how  to  use  rule  (Rl).  The  statement  C  =  (A<<1)AF  will  count  as  2 
operations  (<<  and  A  ).  Note  that  “=”  does  not  count  as  an  operation.  Next,  consider  the  statement 
for(i=0;  i<n;  i=i+l){  }.  This  implements  a  (null)  loop  of  n  iterations,  each  iteration  has  2  operations 
(<  and  +).  Thus,  the  total  operation  count  for  this  loop  statement  is  2 n.  The  for-loop  above  is  equivalent 
to  the  while-loop  i=0;  while  (i<n)  {i=i+l ; }  which,  of  course,  also  has  2 n  operations. 

We  now  show  examples  about  rule  (R2).  Suppose  that  K  =  1,  and  consider  the  following  2  statements: 
if  ( (A&K)  !  =  0)  C  =(A<<1)AF ; 
else  C  =  A<  <  1 ; 

Here,  the  if-statement  has  4  operations  (&,  !=,  <<,  A),  and  the  else-statement  has  3  operations  (&,  !=,  <<). 
Thus,  the  above  2  statements  can  be  considered  as  a  single  statement  that  has  3.5  operations  (i.e.,  the  average 
of  4  and  3). 

The  above  2  statements  are  equivalent  to  the  the  following  2  statements: 

C  =  A«l; 

if  ((A&K)  !=  0)  C  =  CAF; 

Here,  the  first  statement  has  1  operation,  and  the  second  statement  has  2.5  operations.  Thus,  the  2  statements 
together  also  have  3.5  operations  as  expected.  Note  that  (A&K)€  {0, 1},  because  K  =  1.  Here,  for  simplicity, 
we  assume  that  (A&K)  takes  the  values  0  and  1  with  equal  probability  of  1/2.  Suppose  now  that  K  =  3.  We 
then  have  (A&K)  £  {0, 1,2,3}.  By  assuming  that  (A&K)  takes  the  values  0,  1,  2,  and  3  with  equal  probability 
of  1/4,  the  above  if-statement  (which  has  4  operations)  is  executed  with  probability  3/4  and  the  else- 
statement  (which  has  3  operations)  is  executed  with  probability  1/4.  Thus,  these  if-else  statements  can  be 
considered  as  a  single  statement  that  has  4  x  3/4  +  3  x  1/4  =  3.75  operations. 


bitwise 

table  lookup 

basic  CRC 

Algo.  1  (s  <  h) 
Algo.  4  (s  >  h) 

Algo.  3  (s  <  h) 
Algo.  2  (s  >  h) 

fast  CRC 

Algo.  3  (s  <  h) 
Algo.  4  (s  >  h) 

Algo.  3  (s  <  h) 
Algo.  2  (s  >  h) 

Fig.  10  CRC  algorithms  used  in  software  implementation 

Remark  5.  Rules  (Rl)  and  (R2)  serve  as  a  simple  technique  for  comparing  the  complexity  of  different  CRCs, 
i.e.,  they  will  be  used  to  obtain  a  first-order  estimation  of  the  ratio  ei ,/e/.  These  rules  are  intended  only  for 
CRC  algorithms  that  are  implemented  in  C,  and  not  for  other  types  of  algorithms  or  other  programming 
languages.  As  seen  in  the  following,  our  CRC  software  implementation  uses  only  a  small  number  of  elementary 
C  operators  (namely,  +,  <<,  >>,  =,  ==,  !=,  <,  <=,  &,  and  A)  and  C  keywords  (namely,  char,  short,  int, 
long,  unsigned,  if,  else,  for,  while,  and  return).  Our  following  C  programs  for  the  CRCs  are  written  in  a 
style  that  is  intended  to  be  simple  and  straightforward.  See  also  Remark  6. 

Other  techniques  for  counting  operations  are  also  possible.  For  example,  consider  rule  (Rl'),  which  is 
defined  as  rule  (Rl),  but  also  counts  the  equal  sign  (=)  as  an  operation.  Let  e'b  and  e}  denote  the  resulting 
operation  counts  under  (Rl').  We  must  have  e'b  >  eb  and  e}  >  ey.  Although  the  difference  between  eb 
and  e'b  (as  well  as  between  e/  and  e})  can  be  significant,  the  difference  between  the  ratios  eb/ef  and  e}/e} 
are  typically  not  significant.  For  example,  let  s  =  h  =  32.  From  Fig.  9,  we  have  eb  =  44.8,  e/  =  3, 
and  eb/ef  =  14.9.  Under  rule  (Rl'),  it  can  be  shown  that  e'b  =  61.5,  e}  =  4.25,  and  e}/e}  =  14.5,  i.e., 
eb/e/  ~  eb/ef-  Note  that  rule  (Rl),  which  is  used  in  this  paper,  is  slightly  simpler  to  use  than  rule  (Rl7). 
Thus,  our  technique  for  counting  software  operations  is  reasonable  for  the  purpose  of  complexity  comparison, 
i.e.,  we  are  more  interested  in  the  ratio  eb/ef,  rather  than  in  eb  and  e/. 

Here,  for  simplicity,  we  assign  the  same  unit  cost  to  each  operation.  A  more  elaborate  technique  would 
assign  different  costs  to  different  operations.  However,  this  assignment  depends  on  many  factors  (such  as 
computer  hardware,  operating  system,  processor  architecture,  and  memory  organization),  which  are  outside 
the  focus  of  this  paper.  □ 

Let  us  now  compute  x  and  y  in  (32).  The  computation  of  r  is  deferred  to  later  subsections.  First, 
consider  Fig.  11,  which  shows  the  C  program  for  bitwise  implementation  of  the  basic  CRC  for  the  case 
s  <  h.  As  indicated  in  Fig.  10,  this  program  is  based  on  Algorithm  1.  In  this  program,  we  assume  that 
h  €  {8, 16, 32, 64},  i.e.,  h  is  the  size  (in  bits)  of  one  of  the  natural  unsigned  types  of  C:  unsigned  char,  unsigned 
short  int,  unsigned  int,  or  unsigned  long  int.  The  input  is  the  n  message  s-tuples  Q[0],  Q[l\, . . . ,  Q[n  —  1], 
and  the  output  is  the  CRC  check  ft.-tuple  P. 

We  then  apply  rules  (Rl)  and  (R2)  to  the  program  shown  in  Fig.  11  to  obtain  the  desired  operation 
counts.  The  non-zero  operation  count  for  each  program  statement  is  recorded  between  the  comment  quotes 
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(/*  */).  Recall  that  the  total  operation  count  for  computing  the  check  tuple  P  from  the  n  input  tuples  is 
p  =  (xb  +  rb)n  +  yb .  Here,  is  the  operation  count  required  for  computing  B(X ),  which  is  inside  the  loop 
indexed  by  i,  Xb  is  the  operation  count  required  for  computing  all  the  other  terms  in  the  loop  besides  B(X), 
and  yb  is  the  operation  count  required  for  computing  all  the  terms  outside  the  loop.  From  Fig.  11,  we  have 
Xb  =  4  and  yb  =  0. 

To  summarize,  for  h  £  {8, 16, 32,  64}  and  s  <  h,  we  have  Xb  =  4  and  yb  =  0,  which  are  recorded  in  Fig.  13. 
For  illustration,  we  let  h  =  16,  s  =  8  and  M(X)  =  X16  +  X15  +  X2  +  1  (which  generates  the  well-known 
CR.C-16)  in  Fig.  11.  When  h  ^  {8, 16,  32,  64}  and  s  <  h,  the  computational  complexity  is  slightly  higher, 
namely,  Xb  =  4  and  yt,  =  1. 

Fig.  7  shows  the  C  program  for  the  basic  CRC  when  s  =  h.  Next,  consider  Fig.  12,  which  shows  the  C 
program  for  the  fast  CRC  when  h  €  {8, 16,  32, 64}  and  s  <  h  —  1.  It  can  be  shown  that  Xf  =  6  and  y/  =  0 
for  this  case.  Again,  for  illustration,  we  let  h  =  16  and  s  =  8  in  Fig  12.  Similarly,  we  can  compute  the 
values  of  x  and  y  for  all  the  cases  for  both  the  basic  CRCs  and  the  fast  CRCs.  The  results  are  summarized 
in  Fig.  13. 

Using  Fig.  13,  the  expression  (32)  can  be  simplified  as  follows.  Let  z  be  the  ratio  of  the  2  terms  on  the 
right-hand  side  of  (32),  i.e., 

8(x  +  r)/s 
(8  y)/(ns) 

(x  +  r)n 
V 

From  Fig.  13,  we  have  0  <  y  <  1  and  x  >  3,  which  implies  that  z  >  (x  +  r)?r  >  (3  +  r)n  >  3n.  Thus, 
8 y/(ns)  is  much  smaller  than  8(a;  +  r)/s,  because  we  assume  in  this  paper  that  n  is  not  too  small  (i.e.,  we 
assume  that  n  >  4).  Thus,  the  term  8 y/(ns)  can  be  dropped  from  (32).  The  operation  count  per  input  byte 
required  for  computing  the  CRC  check  h- tuple  then  simplifies  to 


8(x  +  r) 
s 


(33) 


where  x  is  determined  from  Fig.  13,  which  depends  only  on  s  and  h,  i.e.,  x  =  x(s,  h).  Recall  that  r  denotes  the 
operation  count  required  for  computing  B(X),  which  also  depends  only  on  s  and  h  [see  (11)],  i.e.,  r  =  r(s,  h). 
It  follows  from  (33)  that  e  now  also  depends  only  on  s  and  h,  i.e.,  e  =  e(s,  h).  From  (33),  we  also  have 


eb_  _  xb  +  rb 
e/  xf  +  rf 

where  Xb  and  xf  are  given  in  Fig.  13.  In  the  following,  using  rules  (Rl)  and  (R2),  we  compute  rb  and  r/  for 
both  the  bitwise  and  the  table-lookup  techniques. 


unsigned  short  basic_CRC  (int  n,  unsigned  char  *Q) 
i 

int  i,  j,  hs,  s; 

unsigned  short  A,  B,  M,  K,  P; 
s  =  8; 


hs  =  8; 

M  =  0x8005; 
K  =  0x8000; 


/*  hs  =  h-s  */ 

/*  M  =  X16+X15+X2+l  V 
/*  K  =  2h_1,  h=16  V 


P  =  0; 


for  (i=0;  i<n;  i=i+l) 

/* 

2  */ 

t 

P  =  P  A  (Q[i]  «  hs); 

/* 

2  */ 

for  (j=0;  j<s;  j=j+l) 

/* 

2  */ 

{ 

if  (  (P&K)  !=  0  )  P  =  (P«l)  A  F; 

/* 

3.5  V 

else  P  =  P«l; 

i 
i 


return  P; 
} 


Fig.  11  C  program  for  the  basic  /?.- b i t  CRC  (s  <  h) 
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unsigned  short 

fast_CRC  (int 

n,  unsigned  char  *Q) 

{ 

int 

i,  hs,  s; 

unsigned  short 
s  =  8; 

A,  B,  P; 

hs  =  8; 

/*  hs  =  h-s, 

h  = 

16  */ 

P  =  0; 

for  (i=0;  i<n; 

i=i+l) 

/* 

2  */ 

{ 

A  =  (P»hs) 

A  Q[i]; 

/* 

2  */ 

B  =  (A«2) 

A  (A«l)  A  A; 

/* 

4  */ 

P  =  B  a  (P«s); 

/* 

2  */ 

} 

return  P; 

} 

Fig.  12  C  program  for  the  fast  h- bit  CRC  (s  <  h  —  1) 


X 

y 

Algo.  1 

4 

0  if  h  =  8, 16, 32, 64 

(s  <  h) 

1  if  h  +  8, 16, 32, 64 

Algo.  2 

3  if  s  =  h 

0 

(s  >  h) 

4  if  s  >  h 

0 

Algo.  3 

6  if  h  =  8, 16,32,64 

0  if  h  =  8, 16, 32, 64 

(s  <  h) 

7  if  h  +  8, 16,  32, 64 

1  if  h  +  8, 16, 32, 64 

Algo.  4 

3 

0  if  s  =  h 

(s  >  h) 

1  if  s  >  h 

Fig.  13  Values  of  x  and  y 


A.l  CRC  Software  Implementation:  Bitwise  Technique  (Without  Table  Lookup) 

According  to  Fig.  10,  the  bitwise  implementation  of  the  the  basic  CRCs  uses  Algorithm  1  for  s  <  h,  and 
Algorithm  4  for  s  >  h.  From  Fig.  13,  we  then  have 

_  J  4  if  .s  <  h 

lb  (3  if  s  >  h 


Substituting  Xb  into  (33),  we  have 


eb 


8(4  +  Tb)/s  if  s  <  h 
8(3  +  Vb)/s  if  s  >  h 


(34) 


where  Tb  denotes  the  operation  count  required  for  computing  B(X )  of  the  basic  CRCs. 

According  to  Fig.  10,  the  bitwise  implementation  of  the  the  fast  CRCs  uses  Algorithm  3  for  s  <  h,  and 
Algorithm  4  for  s  >  h.  From  Fig.  13,  we  then  have 


(6  if  s  <  h  and  h  =  8, 16,  32,  64 
Xf  =  <  7  if  s  <  h  and  h  ^  8, 16,  32,  64 
l  3  if  s  >  h 


Substituting  Xf  into  (33),  we  have 


ef 


8(6  +  rf)/s  if  s  <  h  and  h  =  8, 16,  32, 64 
8(7  +  r/)/s  if  s  <  h  and  h  ^  8, 16,  32, 64 
8(3  +  r/)/s  if  s  >  h 


(35) 


where  r/  denotes  the  operation  count  required  for  computing  B(X)  of  the  fast  CRCs.  Both  and  rf  are 
computed  in  the  following  subsections. 
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A.  1.1  Basic  CRCs 

Recall  that  B(X)  for  the  basic  CRCs  is  given  in  Definition  1.  First,  consider  the  case  s  <  h,  and  let  us  revisit 
Fig.  11.  This  figure  contains  the  loop  (indexed  by  j)  for  computing  B(X),  which  is  based  on  Remark  3.  The 
figure  shows  that  the  operation  count  required  for  computing  B(X)  is  rb  =  5.5 s.  Next,  for  the  case  s  >  h, 
it  can  also  be  shown  that  rt  =  5.5 s  (see  Fig.  7).  To  summarize,  we  have 


rb  =  5.5s 

Substitute  (36)  into  (34)  we  have 

_  f  8(4  +  5.5s)/s 
6b  ~  \  8(3  +  5.5s)/s 

Note  that  (36)  is  derived  from  the  C  programs  that  do  not  use  loop  unrolling  (which  is  also  the  case  for  the 
C  programs  presented  in  [2]).  If  loop  unrolling  is  used,  (36)  reduces  to  rb  =  3.5s. 

Here,  our  software  implementations  of  the  basic  CRCs  are  general,  i.e. ,  they  are  applicable  to  all  generator 
polynomials  M (X)  and  to  a  wide  range  of  processor  architectures.  For  some  specific  generator  polynomials 
that  have  some  desirable  properties,  alternative  implementations  (such  as  shift  and  add  [4],  and  on  the  fly 
[14,  15])  may  have  lower  complexity.  Thus,  we  concentrate  on  the  general  nature  of  the  algorithms  rather 
than  attempting  to  deal  with  specific  types  of  generator  polynomials.  Also,  for  our  C  programs,  we  are  more 
concerned  with  their  readability  and  less  concerned  with  optimization  techniques  such  as  loop  unrolling  and 
use  of  register  variables  (see  Remark  4). 


if  s  <  h 
if  s  >  h 


(36) 


(37) 


A.  1.2  Fast  CRCs 

Recall  that  B(X)  of  the  fast  CRCs  is  given  in  Theorem  1.  First,  assume  that  s  <  h—1.  The  C  program  for 
this  case  is  shown  in  Fig.  12,  which  contains  the  procedure  for  computing  B(X).  Applying  rules  (Rl)  and 
(R2)  to  Fig.  12,  we  observe  that  the  operation  count  required  for  computing  B(X)  is  r/  =  4.  Next,  assume 
that  s  =  h.  The  C  program  for  this  case  is  shown  in  Fig.  8,  which  yields  r/  =  9.  The  C  programs  for  all  the 
other  cases  can  also  be  written,  and  the  resulting  software  complexity  can  also  be  determined.  Following  is 
the  list  of  the  operation  counts  for  all  the  cases: 


rf 


4 

6.5 

9 

9  +  5.5(s  —  h) 


if  s  <  h  —  1 
if  s  =  h  —  1 
if  s  =  h 
if  s  >  h 


(38) 


Substituting  (38)  into  (35),  we  have 


ef 


80/s 

88/s 

100/s 

108/s 

96/s 

,  8[12  +  5.5(s  —  h)\/s 


if 

s 

< 

h- 

- 1 

and 

if 

s 

< 

h- 

- 1 

and 

if 

s 

= 

h- 

- 1 

and 

if 

s 

= 

h- 

- 1 

and 

if 

s 

= 

h 

if 

s 

> 

h 

h  =  8, 16, 32, 64 
h  ±  8, 16, 32, 64 
h  =  8, 16, 32, 64 
h  ±  8, 16, 32, 64 


(39) 


The  operation  count  per  input  byte  e f  for  the  fast  h- bit  CRC  given  in  (39)  is  a  function  of  s,  which  is  the 
size  of  each  input  tuple  Qi(X).  We  now  determine  the  value  of  s  that  minimizes  e/.  These  optimal  values 
are  denoted  by  s*  and  e}. 

First,  assume  that  h  €  {8, 16, 32, 64}.  For  each  h  €  {8, 16,  32, 64},  we  can  search  for  an  s£  {1,2,...,  64} 
such  that  e/  in  (39)  is  minimized.  Our  search  shows  that 


*_f80/(/i-2)  if  h  =  16, 32, 64 
-  \  96 /h  if  h  =  8 


(40) 


which  is  achieved  when 


(h-  2  if  h  =  16, 32, 64 
l  h  if  h  =  8 


(41) 


Next,  assume  that  h  ^  {8,16,32,64}.  For  each  h  £  {8,16,32,64},  4  <  h  <  64,  we  can  search  for  an 
s  £  {1,2,...,  64}  such  that  e/  in  (39)  is  minimized.  Our  search  shows  that 
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which  is  achieved  when 


88/(/i  -  2)  if  ft  >  24 
96 /ft  if  4  <  ft  <  24 


f  ft  -  2  if  ft.  >  24 
[ft,  if  4  <  ft,  <  24 


(42) 


(43) 


Thus,  (41)  and  (43)  show  that  the  complexity  of  the  fast  ft-bit  CRCs  is  minimized  (i.e.,  ey  is  minimized) 
at  either  s  =  ft  or  s  =  ft ,  —  2,  where  s  is  the  size  of  each  input  tuple  Qt(X). 

For  example,  by  letting  ft  =  16,  the  optimal  size  for  each  input  tuple  Qi(X)  is  s*  =  ft  —  2  =  14  [by  (41)], 
and  the  corresponding  minimum  operation  count  is  e}  =  80/(ft  —  2)  =  80/14  =  5.71  [by  (40)].  Information 
on  computers  is  typically  organized  in  bytes  or  words.  Thus,  it  is  of  interest  to  determine  the  optimal  value 
of  ef  when  s  is  restricted  to  a  multiple  of  byte  size  and  word  size,  i.e.,  when  s  is  a  multiple  of  8,  16,  32,  64. 
These  optimal  values,  which  are  obtained  from  (39),  are  shown  in  Fig.  14. 

Recall  that  e/  =  e/(s,ft),  i.e.,  e/  is  a  function  of  s  and  ft.  In  Fig.  14,  for  a  given  ft,  s(opt)  denotes  the 
value  of  s,  1  <  s  <  64,  that  minimizes  e/(s,  ft),  and  the  corresponding  minimum  e/(s,  ft)  is  denoted  by  e^pt\ 
Thus,  we  have  e^pt^  =  ef(s^opt\h)  <  ef(s,h)  for  all  1  <  s  <  64.  Similarly,  s(byte)  denotes  the  value  of  s  € 
{8,16,24,32,40,48,56,64}  that  minimizes  e/(s,ft),  and  the  corresponding  minimum  e/(s,ft)  is  denoted  by 
e^byte).  pjnajjy^  s(word)  denotes  the  value  of  s  G  {8, 16, 32, 64}  that  minimizes  e/(s,  ft),  and  the  corresponding 
minimum  e/(s,  ft)  is  denoted  by  e^word\  For  example,  by  letting  ft  =  64,  we  have  s(opt)  =  62,  e^°pt'1  =  1.29, 
g(byte)  _  e^bytc'1  =  1.43,  s(word)  =  64,  e^word)  _  150.  In  general,  we  must  have  ej°pt')  <  e^ytc'>  <  e^word\ 


h 

S  (opt) 

s(byte) 

s(word) 

Jopt) 

ef 

(byte) 

ef 

(word) 

ef 

4 

4 

8 

8 

24.0 

34.0 

34.0 

6 

6 

8 

8 

16.0 

23.0 

23.0 

8 

8 

8 

8 

12.0 

12.0 

12.0 

10 

10 

8 

8 

9.60 

11.0 

11.0 

12 

12 

8 

8 

8.00 

11.0 

11.0 

16 

14 

16 

16 

5.71 

6.00 

6.00 

20 

20 

16 

16 

4.80 

5.50 

5.50 

24 

22 

24 

16 

4.00 

4.00 

5.50 

32 

30 

32 

32 

2.67 

3.00 

3.00 

40 

38 

40 

32 

2.32 

2.40 

2.75 

48 

46 

48 

32 

1.91 

2.00 

2.75 

56 

54 

56 

32 

1.63 

1.71 

2.75 

64 

62 

56 

64 

1.29 

1.43 

1.50 

Fig.  14  The  optimal  values  of  s  and  e/  for  the  ft-bit  fast  CRCs,  when  s  €  {1,2,...,  63, 64},  when 
s  e  {8,16,24,32,40,48,56,64},  and  when  s  e  {8,16,32,64} 

Remark  6.  Our  C  programs  for  the  CRCs,  which  follow  directly  from  the  pseudocodes  in  Figs.  1-4,  are 
written  in  a  style  that  is  intended  to  be  simple  and  straightforward.  For  readability,  we  use  an  array  (instead 
of  a  pointer)  for  the  input  s-tuples  Q,;.  We  also  avoid  using  any  C  syntax  that  obscures  the  operation 
counts.  For  example,  the  more  explicit  syntax  if  ( (P&K)  !  =0)  is  used  instead  of  the  shorthand  if(P&K). 
Although  these  2  expressions  are  equivalent,  the  former  shows  2  operations  more  clearly.  If  desired,  these  C 
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programs  can  be  rewritten  in  pointer  and  shorthand  style,  for  example,  as  shown  in  Figs.  15  and  16,  which 
are  equivalent  to  Figs.  7  and  8,  respectively.  □ 


#define  M  0x8005  /*  M  =  X16+X15+X2+l  V 

#define  K  0x8000  /*  K  =  2h_1,  h=s=16  V 

#define  s  16 

unsigned  short  basic_CRC  (int  n,  unsigned  short  *Q) 

{ 

register  int  j,  s; 


register  unsigned  short  K,  M,  P, 

*Qi ,  *Qn ; 

Qi  =  Q; 

Qn  =  Q  +  n; 

/*  1  */ 

P  =  0; 

while  (Qi  <  Qn) 

/*  1  V 

{ 

P  A=  *Qi++; 

/*  2  V 

for  (j=0;  j<s;  j++) 

/*  2  V 

{ 

if  (P&K)  P  =  (P«l)  A  M; 

/*  3.5  */ 

else  P  =  P«l; 
i 
i 

return  P; 

} 


Fig.  15  C  program  for  the  basic  ft.-bit  CRC  ( s  =  h )  in  pointer  style 


#define  F  0x7  /*  F  =  X16+X2+X+l  V 

#define  K  0x8000  /*  K  =  Zh_1,  h=s=16  V 

#define  s  16 

unsigned  short  fast_CRC  (int  n,  unsigned  short  *Q) 
{ 

register  unsigned  short  A,  C,  F,  K,  P,  *Qi,  *Qn; 

Qi  =  Q; 


Qn  =  Q  +  n; 

P  =  0; 

/* 

1  */ 

while  (Qi  <  Qn) 

{ 

/* 

1  */ 

A  =  P  A  *Qi++; 

/* 

2  V 

if  (A&K)  C  =  (A«l)  A  F; 
else  C  =  A«l; 

/* 

3.5  V 

if  (C&K)  P  =  (C«l)  a  F; 
else  P  =  C«l; 

/* 

3.5  V 

P  a=  c  a  A; 

} 

/* 

2  */ 

return  P; 

j 


Fig.  16  C  program  for  the  fast  ft.-bit  CRC  (s  =  h )  in  pointer  style 

Remark  7.  When  s  is  small,  we  can  compute  B(X)  =  Rm(.y)  [^(X)^]  >  where  degree(vl(X))  <  s,  using  a 
series  of  if-else  statements  as  follows.  For  example,  suppose  that  s  =  2  and  M (X)  =  Fh{X)  =  Xh+X2+X+1. 
Then  A(X)  €  {0, 1,  X,  X  +  1},  and  it  can  be  shown  that 


B(X) 


0 

X2  +  X  +  l 
X3  +  X2  +  X 
X3  +  l 


if  A{X)  =  0 
if  A{X)  =  1 
if  A(X)  =  X 
if  A(X)  =  X  +  1 
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Note  that  polynomials  can  also  be  represented  as  integer  numbers,  e.g.,  the  polynomial  X3  +  X 2  +  X  is 
equivalent  to  the  decimal  number  14.  Thus,  B(X)  can  be  computed  using  the  C  program  segment  shown  in 
Fig.  17.  Applying  rules  (R.l)  and  (R2)  to  this  C  program  segment,  the  operation  count  for  computing  B(X) 
is  1,  2,  3,  or  3  if  A(X)  is  0,  1,  2,  or  3  (in  integer  representation),  respectively.  We  now  assume  that  the  bits  0 
and  1  of  the  input  message  occur  equally  likely.  Thus,  A(X)  assumes  one  of  the  values  0,  1,  2,  3  with  equal 
probability  of  1/4.  Then,  on  the  average,  the  operation  count  for  computing  B(X)  is  (1  +  2  +  3  +  3)/4  = 
2.25.  In  general,  this  technique  for  computing  B(X)  can  be  applied  to  any  generator  polynomial  M(X). 

Let  k  denote  the  operation  count  required  for  computing  B{X)  =  R. m(x)  [A{X)Xh]  using  this  if-else 
technique,  where  clegree(A(A))  <  s.  Note  that  k  depends  on  s,  i.e. ,  k  =  k(s).  As  shown  above,  we  then 
have  k{ 2)  =  2.25,  which  is  smaller  than  both  and  r/  given  in  (36)  and  (38).  In  general,  it  can  be  shown 
that  k(s)  =  2S~ 1  +  2_1  —  2~s  for  s  >  1.  In  particular,  fc(l)  =  1.  Thus,  this  if-else  technique  is  effective  for 
small  s,  such  as  s  =  1, 2,  or  3.  However,  in  this  paper,  we  are  mainly  concerned  with  the  case  s  >  8,  which 
is  more  commonly  used  in  practice.  For  this  case,  k(s)  is  much  greater  than  both  and  Tf.  Thus,  when 
s  >  8,  the  if-else  technique  is  much  more  expensive  than  the  basic  and  the  new  techniques,  and  it  will  not 
be  discussed  further  in  this  paper.  Note  also  that  this  if-else  technique  is  different  from  the  table-lookup 
technique  (which  will  be  discussed  later).  □ 


if  (A  =  0)  B  =  0;  /*  2.25  */ 

else 
{ 

if  (A  ==  1)  B  =  7; 
else 
{ 

if  (A  ==  2)  B  =  14; 
else  B  =  9; 

} 

} 


Fig.  17  C  program  segment  for  computing  B(X)  =  R Fh(x)  [A(X)Xft]  for  s  =  2 

Remark  8.  Consider  an  input  message  U(X),  which  is  protected  by  an  h- bit  CR.C.  Recall  that  we 
implement  this  CR.C  by  first  dividing  the  input  message  into  n  s-tuples  Q,(X),  i.e.,  we  have  U(X)  = 
(Qo(X),Qi(X), . . . ,  Qn-i(X)).  These  s-tuples  Qi(X)  then  become  the  input  to  one  of  the  CR.C  algorithms. 
Fig.  9  shows  that  the  complexity  of  the  basic  CRCs  is  rather  insensitive  to  the  values  of  s,  whereas  the 
complexity  of  the  fast  CRCs  is  very  sensitive  to  the  values  of  s.  Recall  that  the  operation  count  per  input 
byte  e/  in  (39)  is  a  function  of  s  and  h,  i.e.,  e/  =  ej(s,h).  For  example,  Fig.  9  shows  that  e/(8, 16)  =  10 
and  e/(16, 16)  =  6,  i.e.,  e/(16, 16)  <  e/( 8, 16). 

So  far,  we  do  not  address  the  cost  of  obtaining  the  tuples  Qi  (X).  We  now  address  the  impact  of  this  cost 
by  considering  the  fast  16-bit  CR.C,  i.e.,  h  =  16.  Suppose  that  the  input  message  U(X)  originally  consists 
of  m  bytes,  m  >  4,  denoted  by  I0(A),  I\(X), . . . ,  Jm_i(X).  Each  Ii{X)  is  an  8-tuple.  Thus,  we  need  to 
organize  the  bytes  Ij{X)  into  the  s-tuples  Qi(X).  One  technique  is  to  simply  set  Qi(X)  =  Ii(X),  i.e.,  each 
Qi(X)  is  an  8-tuple.  Let  e  be  the  operation  count  per  input  byte  required  for  CR.C  encoding.  We  then  have 
s  =  8,  and  hence  e  =  e/( 8, 16)  =  10. 

An  alternative  technique  is  first  to  pair  2  adjacent  input  bytes  to  form  16-bit  tuples  from  which  the  check 
bits  are  then  computed.  More  precisely,  we  now  let  s  =  16  and  define  the  new  16-tuples  Qi(X)  by 

n  (Y}  —  f  (Jo(^),-fi(Al))  if  m  is  even 
1  \  (0,/o(X))  if  mis  odd 

and 

mm- J(J*(*)  ,  hi+i{X))  if  in  is  even 
W  ’  ~  \  {hi-\{X),hi{X))  if  m  is  odd 

for 

(  (m  —  2)/2  if  m  is  even 
<  1  ~  \  {m  —  1) /2  if  ?tt  is  odd 

The  algorithm  for  pairing  the  bytes  and  then  computing  the  fast  16-bit  CR.C  is  shown  in  Fig.  18.  Using 
this  algorithm,  it  can  be  shown  that  the  operation  count  per  input  byte  is  e  =  7.5,  which  is  lower  than 
e/( 8, 16)  =  10  of  the  non-pairing  technique.  Note  that  e  =  7.5  >  e/(16, 16)  =  6,  because  of  the  additional 
cost  for  pairing  the  input  bytes  to  form  the  new  16-bit  tuples  to  be  used  for  the  CR.C  computation.  □ 
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1 

if  (m  is  even) 

2 

{Q  =  /o Xs +  h-  i  =  2;} 

3 

else 

4 

{<3  =  A>;  i=  1;} 

5 

B  =  Rf16  [Q(X2  +  X  +  1)] ; 

6 

while  (i  <  m  —  1) 

7 

{ 

8 

© 

£—1 

* 

00 

<s>. 

II 

+ 

I—1 

9 

Q  —  Q  1 i\  i  —  i  +  1; 

10 

A.  =  B  +  Q ; 

11 

B  =  RFl6  [A(X2  +  X  +  1)] ; 

12 

} 

13 

P  =  B- 

14 

return  P; 

Fig.  18  Algorithm  for  computing  the  fast  16-bit  CRC  directly  from  the  m  input  bytes  A 


A. 2  CRC  Software  Implementation:  Table-Lookup  Technique 

Recall  that  the  complexity  of  the  fast  CRCs  is  low  even  without  using  table  lookup.  With  table  lookup,  the 
operation  count  is  reduced  at  the  cost  of  additional  memory  resource.  Although  our  focus  in  this  paper  is 
on  bitwise  algorithms,  we  now  also  present  table-lookup  algorithms  to  illustrate  tradeoffs  between  operation 
count  and  table  size.  Our  formulation  and  results  here  are  straightforward  generalizations  or  variations  of 
well-known  results,  which  are  available  in  [2,  4,  9,  14,  15,  16].  Note  that,  with  table  lookup,  speed  directly 
correlates  with  operation  count  under  ideal  conditions  (e.g.,  the  table  is  stored  in  the  fastest  cache,  and  there 
is  no  cache  miss).  Otherwise,  speed  may  not  correlate  directly  with  operation  count  (e.g.,  when  the  impact 
of  cache  miss  is  not  negligible  [4]). 

For  table-lookup  implementation,  according  to  Fig.  10,  we  use  Algorithm  3  (when  s  <  h)  and  Algorithm  2 
(when  s  >  h)  for  both  the  basic  CRCs  and  the  fast  CRCs.  From  (11),  we  then  have 

B(X)  =  RM(x)  [A{X)Xh] 


where  degree(A(X))  <  s.  In  the  following,  B(X)  is  computed  by  table  lookup.  Let  gb  and  g /  be  the  total 
number  of  table  entries  for  the  basic  CRCs  and  the  fast  CRCs,  respectively. 


A. 2.1  Basic  CRCs 

According  to  Fig.  10,  Algorithms  2  and  3  are  used  for  the  basic  CRCs.  Substituting  the  values  of  x  from 
Fig.  13  for  Algorithms  2  and  3  into  (33),  we  have 


!8(6  +  Tb)/s  if  s  <  h  and  h  =  8, 16,  32,  64 
8(7  +  rb)/s  if  s  <  h  and  h  ^  8, 16,  32,  64 
8(3  +  rb)/s  if  s  =  h 
8(4  +  rb)/s  if  s  >  h 


(44) 


where  rb  is  the  operation  count  required  for  computing  B(X)  via  table  lookup.  The  required  tables  are 
defined  below.  First,  we  write 

s  =  +  t2  +  •  •  •  +  tm 


for  some  m  and  tt  such  that  1  <  m  <  s  and  1  <  t*  <  s  (i  =  1,  2, . . . ,  ro).  Next,  we  decompose  A(X)  into  m 
polynomials  Ai(X),A2(X),  . . . ,  Am_1(X) ,  Am(X)  such  that 


A(X)  =  A1(X)X{t2+S3+-+tm'>  +  A2(X)A(t3+S4+-+tm)  +  •  •  •  +  Am_1(X)Xtm  +  Am(X) 

771—  1 

=  ]T  At(X)X^+-+t^  +  Am(X) 

i—  1 
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with  clegree(A;(X))  <  U,  i  =  1, 2, . . . ,  m.  We  then  have 
B(X)  =  Rm{x)  [A{X)Xh] 

=  Rim(x) 


Ai(X)X(ti+1+-+tm+h)  +  Am(X)Xh 

i=l 

=  E  Rm(x)  [A.l(X)X^+-+t-+h^  +Rm(x)  [ Am(X)Xh ] 

i=  1 
m—  1 

=  'y  '  Ti \A/\  +  Tm  [Am] 

i=l 

where  the  tables  Ti[  ]  are  defined  by 


(45) 


f  Rm(x)  [Ai(X)X^+-+^+h)]  if  1  <  i  <  m 
1  R-M(X)  [Am(X)Xh]  if  i  =  m 


(46) 


where  Ai  denotes  the  tj-tuple  that  is  composed  of  the  binary  coefficients  of  Ai(X).  For  example,  if  ft  =  4 
and  Ai(X)  =  X2  +  1,  then  Ai  =  (0101),  which  is  equivalent  to  the  decimal  integer  5. 

Thus,  regardless  of  whether  s  <  h  or  s  >  h,  the  table  !)[  ]  has  24i  entries,  each  entry  is  an  /i-tuple.  (For 
example,  let  h  =  16  and  tt  =  8.  The  table  Ti[  ]  then  has  28  entries,  16  bits  each,  i.e.,  the  total  memory 
storage  for  this  particular  table  is  28  x  16  bits  =  512  bytes).  Finally,  the  total  number  of  entries  for  the  m 
tables,  denoted  by  gb,  is 

m—  1  m 

gb=J22ti+2tm  =  J22ti  (4?) 

*= 1  i=l 


To  summarize,  for  a  given  polynomial  A(X)  of  degree  less  than  s,  let  m,ti,t2, 

SHi  *»■  Tlie  term 


B{X)  = 

Rm(x)  [A(X)Xh] 


tm  be  such  that  s  = 


can  then  be  computed  using  the  ?n  tables  defined  by  (46).  The  total  number  of  entries  for  these  tables  is 
gb  =  Sill  Further,  regardless  of  whether  s  <  h  or  s  >  h,  it  can  be  shown  that,  using  the  m  tables,  the 
number  of  operations  required  for  computing  B(X)  is 


rb  =  3(?7i  —  1) 


(48) 


We  now  consider  the  special  case  t\  =t%  =  •  •  ■  =  tm  =  s/m.  The  m  tables  defined  in  (46)  then  becomes 


Ti  [Ai] 


R 


■M(X) 


A,{X)Xh+s{m-i)/m 


i  =  1,2, ...  ,m.  Each  of  the  m  tables  has  2S//m  entries.  From  (47),  the  total  number  of  table  entries  is 


gb  =  J2  r/m  =  m2S/m  (49) 

i=l 


Equations  (48)  and  (49)  show  tradeoffs  between  the  operation  count  rb  and  the  table  size  gb.  That  is,  to 
decrease  the  table  size,  we  must  increase  m  in  gb  =  m2s/m,  and  this  in  turn  will  increase  the  operation 
count  rb  =  3 (m—  1).  Thus,  smaller  (larger)  table  size  gb  will  yield  larger  (smaller)  operation  count  rb.  In 
particular,  when  m  =  1,  we  have  rb  =  0  and  gb  =  2s.  When  m  =  s,  we  have  rb  =  3(s  —  1)  and  gb  =  2s. 

Substituting  rb  =  3  (to  —  1)  into  (44),  we  have 


eb 


(24  m.  +  24)  /  s 
(24m  +  32) /s 
24m  /  s 
(24  m.  +  8)/s 


if  s  <  ft,  and  ft,  =  8, 16, 32,  64 
if  s  <  ft  and  ft  8, 16, 32,  64 
if  s  =  ft 
if  s  >  ft 


(50) 
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Note  that  our  formulation  is  a  straightforward  generalization  of  [15],  which  contains  the  results  for  the  special 
cases  h  =  16,  s  €  {8, 16},  and  to  €  { 1,  s} .  Our  results  [e.g.,  (49)]  also  resemble  those  of  [9],  which  presents 
in-depth  studies  of  the  case  h  =  32. 

Note  that  the  function  /(: r)  =  2X  is  convex.  Given  s  =  Y/,iLi  from  Jensen’s  inequality,  it  can  then  be 
shown  that  to-1  YlT=i  >  2s/m,  i.e. ,  >  m2slm .  This  implies  that,  for  a  given  to,  the  table  size  gb 

in  (47)  is  minimized  when  t\  =  £2  =  •  •  •  =  tm.  Thus,  we  focus  on  only  this  special  case  (f,  =  s/to)  in  this 
paper. 

For  example,  let  h  =  16,  s  =  8,  and  to  =  1.  That  is,  we  use  a  basic  16-bit  CR.C  to  protect  a  message 
consisting  of  input  bytes  (s  =  8).  This  CR.C  is  implemented  using  one  lookup  table  (to  =  1),  which  has 
gb  =  m2s'm  =  28  entries.  Using  (50)  with  s  <  h,  we  have  et,  =  (24m.  +  24)/s  =  6.  That  is,  6  operations  are 
required  for  computing  the  check  tuple  per  input  byte.  These  results  are  recorded  in  the  first  row  of  Fig.  19. 
Note  that,  because  each  table  entry  has  h  =  16  bits,  the  total  storage  is  hgb  =  16  x  28  bits  =  512  bytes 
(which  is  not  shown  in  the  figure).  The  results  for  other  values  of  h,s,  and  m  are  shown  in  Fig.  19. 

From  Fig.  19,  we  observe  the  followings.  First,  the  results  for  the  cases  h  =  8  and  h  =  16  are  identical 
for  s  =  32,64,  i.e.,  they  differs  only  for  s  =  8, 16.  This  follows  directly  from  (50).  Similarly,  the  results  for 
the  cases  h  =  16  and  h  =  32  are  identical  for  s  =  8,64,  i.e.,  they  differs  only  for  s  =  16,32.  Although  the 
number  of  table  entries  gb  =  m2s/m  depends  on  only  s  and  to,  the  total  storage  is  hgb  =  hm2s/m,  which 
also  depends  on  h. 

Recall  from  Fig.  9  that  the  complexity  results  for  bitwise  implementation  of  the  basic  CRCs  vary  little 
over  a  wide  range  of  s  values.  In  contrast,  as  seen  in  Fig.  19,  those  for  table-lookup  implementation  vary 
greatly  with  s.  These  results  can  also  be  used  to  optimize  CR.C  table-lookup  implementation.  For  example, 
suppose  that  h  =  16.  Let  us  compare  the  2  cases:  (s  =  8 ,m  —  1)  and  (s  =  16, to.  =  4)  in  Fig.  19.  In 
both  cases,  the  required  operation  count  per  input  byte  is  et  =  6.  However,  the  first  case  requires  one 
table  of  28  entries  (=  16  x  2s  =  512  bytes),  while  the  second  case  requires  4  tables  totaling  only  26  entries 
(=  16  x  26  =  128  bytes),  which  is  75%  less  than  the  first  case.  More  generally,  Fig.  19  shows  that,  for  a 
given  eb,  the  total  number  of  table  entries  gb  is  minimized  when  s  =  h. 
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12 

13.5 

13.5 

25 

s  =  32 

1 

1 

1 

0.75 

1.5 

~232~ 

2 

1.75 

1.75 

1.5 

2.25 

217 

4 

3.25 

3.25 

3 

3.75 

210 

8 

6.25 

6.25 

6 

6.75 

27 

16 

12.25 

12.25 

12 

12.75 

26 

s  =  64 

1 

0.5 

0.5 

0.5 

0.375 

2B4 

2 

0.875 

0.875 

0.875 

0.75 

233 

4 

1.625 

1.625 

1.625 

1.5 

218 

8 

3.125 

3.125 

3.125 

3 

211 

16 

6.125 

6.125 

6.125 

6 

28 

32 

12.125 

12.125 

12.125 

12 

27 

Fig.  19  Complexity  results  for  table-lookup  technique  for  the  basic  ft,-bit  CRCs  ( eb  =  operation  count  per  input 

byte,  gb  =  total  number  of  entries  from  m  tables) 

Remark  9.  Both  gb  and  depend  on  s,  h,  and  to,  i.e.,  we  can  write  gb  =  gb{s,h,m)  and  eb  =  eb(s,h,m). 
Consider  the  2  special  cases:  to.  =  s/2  and  to  =  s.  From  (49)  and  (50),  it  can  be  shown  that  gb(s,h,s/ 2)  = 
gb(s,h,s )  =  2s  and  eb(s,h,s/ 2)  <  eb(s,h,s).  That  is,  these  2  cases  yield  the  same  table  size,  but  the  case 
to.  =  s/2  always  yields  lower  operation  count  than  the  case  to  =  s.  Thus,  the  case  to  =  s  can  be  eliminated 
from  our  discussion.  □ 

Remark  10.  So  far,  B(X)  is  computed  by  either  the  bitwise  technique  or  the  table-lookup  technique. 
However,  B(X)  can  also  be  computed  using  both  techniques  as  follows.  Recall  from  (45)  that  B(X)  is  the 
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sum  of  m  terms.  Suppose  that  we  now  use  tables  to  compute  the  first  to  —  1  terms,  and  use  no  tables  to 
compute  the  last  term.  More  precisely,  from  (45),  we  have 


B{X)  =  Y,  R M(X)  \A.l{X)X^+-+t™+V  1  +  Rm(a')  [Am(X)Xh[ 


i= 1 


=  Y  TAM  +  R  M(X)  [Am(X)Xh] 


where  the  to  —  1  tables  T([  ]  are  defined  by  Tj[Aj\  =  Rm(jc)  [Ai(X)X('ti+1~*  Hm+fc)J)  \  <  i  <  m.  Assume 
now  that  Rm(x)  [Am(A)A/l]  is  computed  without  using  tables.  Thus,  B(X)  can  be  computed  using  the 
2  techniques  at  the  same  time:  the  table  lookup  technique  (with  the  to  —  1  tables  7)[  ])  and  the  bitwise 
technique  (for  computing  Rm(x)  [Am(X)Xh]  without  using  tables).  In  the  following,  this  mixed  technique 
is  applied  to  the  fast  CRCs  to  yield  small  table  size  when  s  ~  h.  □ 


A. 2. 2  Fast  CRCs 

Recall  from  Section  2.1  that,  when  implementing  an  ft.-bit  CRC,  we  are  free  to  choose  the  value  of  s,  which  is 
the  size  of  each  input  tuple  Qi(X).  That  is,  we  can  choose  s  <  h,  s  =  h,  or  s  >  h.  Fig.  9  shows  that,  under 
bitwise  implementation,  the  fast  CRCs  are  much  faster  than  the  basic  CRCs  for  s  <  h,  in  the  sense  that  e/ 
is  much  smaller  than  ef,.  Further,  by  comparing  Fig.  9  with  Fig.  19,  we  see  that  the  bitwise  implementation 
of  the  fast  CRCs  (i.e.,  gf  =  0)  is  even  faster  than  the  table-lookup  implementation  of  the  basic  CRCs  (i.e., 
c/f,  >  0)  in  many  cases.  For  example,  consider  the  case  s  =  h  =  32.  Fig.  19  shows  that  =  6  when  =  27 
(at  to.  =  8),  and  e&  =  12  when  g 5  =  26  (at  to  =  16).  On  the  other  hand,  Fig.  9  shows  that  e/  =  3  when 
gf  =  0.  The  same  figures  also  show  that,  although  the  fast  CRC  requires  no  table  lookup  (i.e.,  gf  =  0)  and 
the  basic  CRC  requires  a  table  of  g b  =  210  entries  (at  to  =  4),  both  CRCs  have  the  same  operation  count 
ef  =  eb  =  3. 

Recall  from  (41)  and  (43)  that,  under  bitwise  implementation,  e/  is  minimized  either  at  s  =  h  —  2  or  at 
s  =  h.  Thus,  by  choosing  s  to  be  at  (or  near)  these  optimal  values,  the  fast  CRCs  require  no  table  lookup 
(i.e.,  gf  =  0)  and  still  have  low  operation  count  (i.e.,  e/  is  small). 

We  now  discuss  table-lookup  techniques  for  the  fast  CRCs  generated  by  Fh(X)  =  Xh  +  X2  +  X  +  1. 
An  obvious  technique  is  to  apply  the  table-lookup  technique  in  Section  A. 2.1  for  the  basic  CRCs  to  the  fast 
CRCs  by  simply  letting  M (X)  =  Fh(X).  The  required  total  number  of  table  entries  is  then  given  by  (49), 
i.e.,  gf  =  gb  =  m2s^m.  In  the  following,  for  the  case  s  >  h,  we  present  another  table-lookup  technique  (which 
is  similar  to  the  mixed  technique  in  Remark  10  with  to  =  2)  that  exploits  the  special  structure  of  Fh( X)  to 
yield  gf  =  2s~h+2,  which  is  small  when  s  «  h  ,  e.g.,  gf  =  4  when  s  =  h. 

Recall  that  77  denotes  the  operation  count  required  for  computing  B(X).  Without  using  tables  (i.e., 
when  gf  =  0),  77  is  given  by  (38),  i.e.,  77  =  9  +  5.5(s  —  h)  for  s  >  h.  We  show  below  that  77  is  slightly 
reduced  by  using  a  small  lookup  table. 

Assume  that  s  >  h.  According  to  Fig.  10,  we  use  Algorithm  2  to  implement  the  table-lookup  technique 
for  the  fast  CRCs  when  s  >  h.  From  Fig.  13,  we  then  have 

__  /  3  if  s  =  h 
X  ~  1 4  if  S>h 


which  is  inserted  in  (33)  to  yield 

e  <f8(3  +  77)/s  iis  =  h 

f  \  8(4  +  77) /s  if  s  >  h  '  ' 

where  77,  which  denotes  the  operation  count  required  for  computing  B(X)  via  table  lookup,  is  determined 
in  the  following. 

First,  we  decompose  A(X)  into  Ai(X)  and  A2(X)  such  that 


A(X)  =  A1(X)Xh~2  +  A2(X) 
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where  degree(vli(X))  <  s  —  h  +  2  and  degree (A2(X))  <  h  —  2.  Using  (11)  with  M(X )  =  Fh(X),  we  have 

B(X)  =  RFh(X)  [A(X)Xh] 

=  RFh{X)  [A(X)(Xh  +  Fh(X))\ 

=  RFh(x)  [A(X)(X2  +  X  +  1)] 

=  R .Fh{X)  [Ai(X)Xh-2(X2  +  X  +  1)]+  A2(X)(X2  +  X  +  1) 

=  Tf[Ai)  +  A2(X)(X2  +  X  +  1) 


where  Tf[  ]  is  the  table  defined  by 


Tf[Ai]  =  RFh{x)  [4i(I)Ik-2(X2  +  X  +  1)] 


(52) 


where  A\  denotes  the  (s  —  h  +  2)-tuple  that  is  composed  of  the  binary  coefficients  of  the  polynomial  A\(X) 
of  degree  less  than  s  —  h  +  2.  The  table  Tf[  ]  has  gf  =  2s~h+ 2  entries,  and  each  entry  contains  h  bits.  Using 
this  table,  it  can  be  shown  that  the  operation  count  required  for  computing  B(X)  is  rf  =  7.  To  summarize, 
when  s  >  h,  we  have 


rf  =  7i  9f 


2  s —  h-\-2 


(53) 


Substituting  (53)  into  (51),  we  then  have  the  following  operation  count  per  input  byte  and  the  table  size 
for  the  case  s  >  h: 


_  (  80/s,  9f  =  4  if  s  =  h 

ef~\  88/s,  gf  =  2s~h+2  if  s>h 


(54) 


i.e. ,  the  table  size  gf  grows  exponentially  with  the  difference  s  —  h.  Thus,  this  table-lookup  technique  is  not 
recommended  for  large  s  —  h.  To  have  a  small  table,  we  must  choose  s  that  is  sufficiently  close  to  h.  The 
table  size  is  minimized  when  s  =  h,  which  yields  gf  =  4,  i.e.,  the  fast  h- bit  CRCs  now  require  e/  =  80 /h 
operations  per  input  byte  and  a  small  table  of  only  gf  —  4  entries.  This  is  20%  lower  than  the  bitwise 
technique  (i.e.,  gf  =  0)  that  requires  ey  =  96 /h  operations  per  input  byte.  Fig.  20  shows  the  numerical 
values  of  (54)  for  s,h  €  {8, 16, 32,  64},  which  vary  greatly  with  h  and  s.  In  particular,  the  table  size  is  large 
(gf  >  210)  when  s  >  h,  but  is  very  small  (gf  =  4)  when  s  =  h. 

For  the  special  case  s  =  h  for  which  gf  =  4,  it  can  be  shown  that  the  4  entries  of  the  table  (52)  are  given 
by 

Tf[  0]  =0 

T}[  1]  =  Xh~l  +  Xh~ 2  +  X2  +  X  +  1 

Tf[  2]  =Xh~1  +  X3  +  1 

Tf[  3]  =  Xh~ 2  +  X3  +  X2  +  X 


These  entries  in  hexadecimal  are  shown  in  Fig.  21. 

The  table-lookup  algorithm  for  the  fast  CRC  (when  s  >  h)  is  given  in  Fig.  22,  where  the  2s~h+2  entries 
of  the  table  Tf[ }  defined  by  (52)  are  stored  in  the  top  part  of  the  algorithm.  The  C  program  for  the  special 
case  s  =  h  =  16,  which  is  based  on  Fig.  22,  is  given  in  Fig.  23. 


s 

ef 

9f 

00 

II 

8 

10 

22 

16 

5.5 

210 

32 

2.75 

226 

64 

1.375 

2S8 

h  =  16 

16 

5 

22 

32 

2.75 

218 

64 

1.375 

250 

h  =  32 

32 

2.5 

22 

64 

1.375 

234 

h  =  64 

64 

1.25 

22 

Fig.  20  Complexity  results  for  the  fast  h- bit  CRCs  with  table  lookup,  s  >  h  (e/  =  operation  count  per  input 

byte,  gf  =  total  number  of  table  entries) 
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h 

Tfl  0] 

Tf[  1] 

Tf[  2] 

T/[  3] 

8 

0 

c7 

89 

4e 

16 

0 

c007 

8009 

400e 

32 

0 

C0000007 

80000009 

4000000e 

64 

0 

(•000000000000007 

8000000000000009 

400000000000000e 

Fig.  21  Four-entry  tables  for  the  fast  /i-bit  CRCs  (s  =  h) 


1 

store  Tf[0\,...,Tf[2-*+'J-i[; 

2 

B  =  0; 

3 

for  (0  <  i  <  n) 

4 

{ 

5 

A  =  BXs-h  +  Qi- 

6 

Ai  =  (s  —  h  +  2)  left-hand  bits  of  A; 

7 

A2  =  (h  —  2)  right-hand  bits  of  A\ 

8 

B  =  Tf[Ai]  +  A2(X2  +  X  +  1); 

9 

} 

10 

P  =  B- 

11 

return  P ; 

Fig.  22  Table-lookup  algorithm  for  the  fast  h- bit  CRC  (s  >  h) 


unsigned  short 

fast_CRC_table  (int  n,  unsigned  short  *Q) 
{ 

int  i; 

unsigned  short  A,  Al,  A2,  P; 
static  unsigned  short  T[4]  = 

{0x0,  0xc007,  0x8009,  0x400e}; 

P  =  0; 

for  (i=0;  i<n;  i=i+l)  /*  2  */ 

{ 

A  =  P  A  Q[i]  ;  /*  1  */ 

Al  =  A  »  14;  /*  1  */ 

A2  =  A  &  0x3fff ;  /*  1  */ 

P  =  T  [Al]  A(A2«2) A(A2«1) AA2 ;  /*  5  */ 
} 

return  P; 

i 


Fig.  23  C  program  for  the  fast  h- bit  CRC  with  table  lookup  (s  =  h  =  16) 

APPENDIX  B  OTHER  FAST  ERROR-DETECTION  CODES 

So  far,  we  apply  the  new  technique  (15)  to  Fh(X)  =  Xh  +  X2  +  X  +  1  to  yield  the  fast  /i-bit  CRCs.  We 
now  use  this  same  technique  to  construct  some  other  error-detection  codes,  which  are  also  fast  and  have 
minimum  distances  d  =  2,  3,  or  4. 

Recall  from  Section  3.3  that  the  maximum  length  of  an  error-detection  code  is  defined  to  be  the  total  bit 
length  at  or  below  which  its  minimum  distance  is  d  >  3,  i.e.,  beyond  which  its  minimum  distance  will  reduce 
to  d  =  2.  Theorem  2  shows  that  the  maximum  length  of  a  CRC  is  the  period  of  its  generator  polynomial. 
In  the  following,  n b  denotes  the  total  bit  length  of  a  code. 


B.l  Fast  CRCs  Generated  by  Binomials 


Consider  the  /i-bit  CRC  generated  by  the  binomial  M (X)  =  Xh  + 1,  which  has  period  h.  To  avoid  triviality, 
we  assume  that  this  CRC  includes  at  least  one  input  bit,  i.e.,  rib  >  h.  From  Theorem  2,  this  CRC  then  has 
the  minimum  distance  d  =  2,  i.e.,  it  is  a  weak  code  for  error  detection.  This  CRC  can  be  implemented  via 
Fig.  3  (for  s  <  h)  or  Fig.  4  (for  s  >  h).  Applying  the  new  technique  (15)  to  M(X)  =  Xh  + 1,  the  term  B(X) 
in  these  figures  is  given  by 


B(X)  = 


A(X)  if  s  <  h 

Rn(x)  [A(X)Xs-'1]  if  s>h 
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where  N(X)  =  ( Xh  +  1)XS  h.  Note  that  by  choosing  s  <  h,  we  have  B{X)  =  A(X),  i.e. ,  the  polynomial 
division  is  eliminated. 

Suppose  now  that  s  =  h.  The  CRC  generated  by  Xh  +  1  can  then  be  implemented  by  Fig.  6  with 
B(X)  =  A(X).  Fig.  6  can  be  further  simplified  to  yield  the  following  pseudocode  for  computing  the  check 
/i-tuple  P(X): 


1 

P  =  0; 

2 

for  (0  <  i  <  n) 

3 

P  —  P  +  Qi', 

4 

return  P; 

which  yields 


n— 1 

p(x)  =  YJQ,(x) 


1=0 


i.e.,  the  CRC  generated  by  M(X)  =  Xh  +  1  is  identical  to  the  bock-parity  checksum  [4].  From  the  above 
pseudocode,  it  can  be  shown  that  computing  the  check  tuple  P(X )  for  this  checksum  requires  e  =  24/s 
operations  per  input  byte.  Recall  from  Section  4  that  e&  =  8(3  +  5.5s) /s  and  e/  =  96/s.  We  then  have 
ef/e  =  96/24  =  4  and  eb/e  =  8(3  +  5.5s)/24  =  (3  +  5.5s)/3. 

For  example,  if  s  =  h  =  16,  then  computing  the  check  tuple  P(X)  for  the  16-bit  bock-parity  checksum 
requires  e  =  24/16  =  1.5  operations  per  input  byte.  We  then  have  ef/e  =  4  and  e^/e  =  (3  +  5.5  x  16)/3  = 
30.33.  Thus,  as  expected,  the  bock-parity  checksum  (which  has  minimum  distance  d  =  2)  is  substantially 
faster  than  the  fast  and  basic  CRCs  (both  of  which  have  minimum  distance  d  =  4). 


B.2  Fast  CRCs  Generated  by  Trimomials 


Let  C  be  the  CRC  generated  by  the  trinomial  Th{X)  =  Xh  +  X  +  1.  The  periods  t  of  the  trinomials  are 
given  in  Fig.  24  for  h  >  3.  Note  that  the  periods  t  for  the  important  cases  h  =  8,  16,  32,  64  are  unusually 
small.  Because  Th(X)  is  a  codeword  of  weight  3,  the  minimum  distance  d  of  this  CRC  must  satisfy  d  <  3. 
From  Theorem  2,  we  then  have  d  =  3  if  nb  <  t,  and  d  =  2  if  >  t.  This  CRC  can  be  implemented  via 
Fig.  3  (for  s  <  h)  or  Fig.  4  (for  s  >  h).  Applying  the  new  technique  (15)  to  M ( X )  =  7) t(X),  the  term  B(X) 
in  these  figures  is  given  by 


R/ Y)  _  f  R-n(x)  [A(X)(X  +  1)]  if  s<h 

j  “  \  Rjv(x)  [A(X)Xs~h(X  +  1)]  if  s  >  h 


(55) 


where  N(X)  =  ( Xh  +  X  +  l)Xs~h.  Remark  1  implies  that  it  is  simpler  to  compute  the  B(X)  in  (55)  than 
the  B(X)  in  (16).  Thus,  the  CRC  generated  by  the  trinomial  Th(X)  =  Xh  +  X  +  1  is  faster  than  the  fast 
CRC  generated  by  Fh(X)  =  Xh  +  X2  +  X  +  1.  However,  the  former  has  minimum  distance  only  d  =  3, 
whereas  the  latter  has  minimum  distance  d  =  4.  Further,  for  the  important  cases  of  h  =  8,16,32,64,  the 
maximum  length  of  the  faster  CRC  generated  by  the  trinomial  Th(X)  is  much  shorter  than  that  of  the 
fast  CRC  generated  by  iy,  (A).  For  example,  the  faster  16-bit  CRC  generated  by  Ti6(X)  has  d  =  3  and 
the  maximum  length  of  only  255  bits  (see  Fig.  24),  whereas  the  fast  CRC  generated  by  Fiq(X)  has  d  =  4 
and  the  maximum  length  of  215  —  1  =  32767  bits  (see  Section  3.2).  Thus,  these  2  types  of  CRCs  illustrate 
tradeoffs  between  code  capability  and  complexity. 
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h 

period 

2h-l 

period 

3 

7 

l 

4 

15 

l 

7 

127 

l 

8 

63 

4.05 

15 

32767 

1 

16 

255 

257 

23 

2088705 

4.02 

24 

2097151 

8 

31 

2097151 

1024 

32 

1023 

4.2  x  106 

63 

263  -  1 

1 

64 

4095 

4.5  x  1015 

127 

2127  -  1 

1 

128 

16383 

2.1  x  1034 

Fig.  24  The  period  of  trinomial  Th{X)  =  Xh  +  X  +  1 

B.3  Fast  and  Optimal  Error-Detection  Codes 

In  the  following,  we  construct  codes  that  are  not  only  fast,  but  also  have  optimal  error-detecting  capability. 
The  ft- bit  CRC  in  Section  B.2,  which  is  denoted  by  C  and  has  minimum  distance  d  =  3,  can  be  extended 
to  yield  a  code  that  has  d  =  4  by  adding  an  overall  parity  bit  to  the  h- bit  CRC.  Note  that  this  extended 
code,  denoted  by  C* ,  has  h*  =  ft  + 1  check  bits  and  is  not  a  CRC.  The  ft-bit  CRC  has  burst-error-detecting 
capability  b  =  h.  The  following  theorem  shows  that  the  extended  code  C*  has  burst-error-detecting  capability 
b  =  ft*  =  h  +  1. 

Theorem  4.  Let  C  be  an  h- bit  CRC  generated  by  a  polynomial  M(X)  of  degree  h.  Assume  that  X  is  not 
a  factor  of  M(X),  i.e. ,  gcd (X,M(X))  =  1,  and  that  M(X)  has  odd  weight,  i.e. ,  it  has  an  odd  number  of 
terms.  Let  C*  be  the  non-CRC  code  that  is  obtained  by  adding  an  overall  parity  check  bit  to  C,  i.e.,  C*  has 
h*  =  h  +  1  check  bits.  Then  C*  detects  all  error  bursts  of  length  ft,  +  1  or  less,  i.e.,  its  burst-error-detecting 
capability  is  b  =  ft  +  1. 

Proof.  Let  V*(X)  be  a  codeword  of  C*.  By  definition  of  C* ,  we  have  V*(X)  =  V(X)X  +  parity (P(X)), 
where  V(X)  is  a  codeword  of  C.  Because  V(X)  is  a  codeword  of  the  CRC  generated  by  M(X),  we  have 
V(X)  =  K(X)M(X)  for  some  polynomial  K(X)  (see  the  proof  of  Theorem  2).  We  then  have 

V*{X)  =  K(X)M(X)X  +  parity  (Lf(X)M(X)) 

Let  E*(X)  be  an  error  burst  of  length  ft  +  1  or  less,  which  has  the  form 

E*(X)  =  X\E{X)  +  l) 

where  i  >  0,  and  E(X)  is  a  polynomial  such  that  E(X)  yf  1  and  degree(A(X))  <  ft.  Using  proof  by 
contradiction,  we  now  show  that  E*(X)  cannot  be  a  codeword  of  C*.  Thus,  assume  that  E*(X)  is  a  nonzero 
codeword  of  C*,  i.e., 

X\E(X)  +  1)  =  K(X)M(X)X  +  parity  (K(X)M(X)) 

We  consider  2  cases:  i  =  0  and  *  >  0. 

Case  1:  i  =  0.  We  then  have 

E(X)  +  1  =  K(X)M(X)X  +  parity  (K(X)M(X)) 

This  implies  that  parity (K(X)M(X))  =  1.  Thus,  K(X)  0,  which  implies  that  degree(A"(A)M(A)X)  > 
ft.  But  we  also  have  E(X)  =  K(X)M(X)X,  which  implies  that  clegr ee(K (X) M (X) X)  <  ft,  which  is  a 
contradiction  to  the  previous  statement. 

Case  2:  i  >  0.  We  then  have  parity (K(X)M(X))  =  0.  Thus,  Xi(E(X)  +  1)  =  K(X)M(X)X.  Because 
gcd(X,  M(X))  =  1,  we  must  have  X1  =  K(X)X.  Thus,  K(X)  =  Xt~1.  We  then  have  parity (K(X)M(X))  = 
parity(M(X))  =  1,  which  is  a  contradiction  to  the  previous  statement  that  parity (K(X)M(X))  =  0.  □ 

Let  t  be  the  period  of  the  polynomial  M(X)  in  Theorem  4.  The  extended  code  C*  in  Theorem  4  then 
has  ft*  =  ft  +  1  check  bits,  the  burst-error-detecting  capability  b  =  ft*  =  ft  +  1,  the  minimum  distance 
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d  =  4,  and  the  maximum  length  of  t  +  1  bits.  In  the  following,  we  show  that  C*  becomes  fast  by  choosing 
M(X)  =  Th(X)  =Xh  +  X+l,  i.e.,  M(X)  is  a  trinomial. 

Thus,  let  M ( X )  =  Xh  +  X  +  1,  and  Pcrc(X)  be  the  check  ft-tuple  for  the  CRC  generated  by  this 
particular  M(X).  Suppose  that  s  =  ft,  +  1.  Because  s  >  ft,  the  check  tuple  Pcrc{X)  can  be  computed  by 
Algorithm  4  (see  Fig.  4),  in  which  the  term  B(X)  is  computed  by  (55),  i.e., 


B(X)  =  R  'N(X)  [A(X)X(X  +  1)] 
=  Rn(X)  [A(X)(X2  +  X)] 


where  N(X)  =  ( Xh  +  X  +  1)X,  and  clegree(A(X))  <  s  =  ft  +  1. 

Recall  from  Theorem  4  that  the  non-CRC  code  C*  is  obtained  by  adding  an  overall  parity  check  bit  to 
the  above  CRC.  The  overall  parity  bit  of  C*  is  computed  as  follows.  First,  we  define 

n—  1 

W(X)  =  J2  Qi(x)  +  Pcnc(X)X 

i— 0 

where  Qo(X), . . . ,  Qn-±(X)  are  the  input  s-tuples.  The  overall  parity  bit  of  C*  is  also  the  parity  bit  of 
W(X').  The  check  polynomial  of  C*  is  then 


P{X)  =  PCrc(X)X  +  parity(IU(X)) 


which  is  a  polynomial  of  degree  <  ft  +  1. 

Fig.  25  shows  an  implementation  of  C*,  which  is  based  on  Fig.  4  (with  s  =  ft*  =  ft  +  1  and  M(X)  = 
Xh  +  X  +  1)  and  includes  the  calculation  of  the  overall  parity  bit  of  C*.  Let  e*  be  the  operation  count 
per  input  byte  required  for  computing  the  check  tuple  P(X)  for  the  code  C*.  By  ignoring  the  negligible 
complexity  due  to  the  terms  outside  the  loop  indexed  by  i  in  Fig.  25,  it  can  be  shown  that  e*  =  96 /ft*.  It  is 
shown  in  (39)  of  Appendix  A  that  the  complexity  for  the  fast  ft*- bit  CRC  is  also  given  by  e/  =  96/ft*  (when 
s  =  ft*).  Thus,  e*  =  e,f,  i.e.,  the  (non-CRC)  ft*-bit  code  C*  is  as  fast  as  the  fast  ft*-bit  CRC. 

Let  M(X)  be  a  primitive  polynomial  of  degree  ft,  i.e.,  the  period  of  M(X)  is  t  =  2h  —  1.  Let  us  now 
compare  the  capability  and  complexity  for  the  following  2  codes,  each  of  which  has  ft*  =  ft  +  1  check 
bits.  The  first  code  is  the  familiar  basic  CRC  generated  by  (X  +  l)M(X),  which  has  d  =  4,  b  =  ft,  +  1, 
and  the  maximum  length  of  2h  —  1  bits.  An  example  is  the  well-known  CR.C-16,  which  is  generated  by 
M(X)  =  (X  +  1)(X15  +  X  +  1)  =  X16  +  X15  +  X2  +  1.  Under  bitwise  implementation,  this  basic  CRC 
requires  =  45.5  operations  per  input  byte  for  computing  its  check  tuple,  provided  that  the  input  message 
is  composed  of  16-tuples,  i.e.,  s  =  ft  =  16  (see  Fig.  9). 

The  second  code  is  the  non-CRC  code  C*  as  described  in  Theorem  4,  which  has  d  =  4,  6  =  ft  +  1,  and 
the  maximum  length  of  t  +  1  =  2h  bits  (which  is  1  bit  longer  than  the  basic  (ft  +  l)-bit  CRC  above).  It 
is  well-known  that  any  code  that  has  ft  +  1  check  bits  and  the  minimum  distance  d  =  4  must  satisfy  the 
following  2  constraints:  (1)  the  burst-error  detecting  capability  ft  <  ft  +  1  and  (2)  the  maximum  length 
<  2h.  Thus,  the  non-CRC  code  C*  is  optimal  for  error  detection  in  the  sense  that,  with  ft*  =  ft,  +  1  check 
bits  and  d  =  4,  it  has  the  optimal  ft  =  ft*  and  the  optimal  maximum  length  2h.  In  fact,  at  the  maximum 
length  of  2h  bits,  the  code  C*  is  a  ( 2h,2h  —  ft  —  1,4)  extended  Hamming  perfect  code  with  the  optimal 
burst-error-detecting  capability  ft  =  ft  +  1.  Also,  it  is  well-known  that  the  undetected  error  probability  of 
this  perfect  code  is  bounded  above  by  2~^h+1\ 

As  shown  in  Fig.  25,  the  code  C*  is  fast  when  M(X)  =  Th(X)  =  Xh  +  X  +  1,  i.e.,  M(X)  is  a  trinomial. 
It  is  known  that  T/,(X)  is  primitive  for  some  values  of  ft  [18],  including  ft  =  3,  7,  15,  63,  and  127  (i.e., 
ft  +  1  =  4,  8,  16,  64,  and  128).  For  example,  let  ft  =  15  and  s  =  ft  *  =  ft  +  1  =  16.  Using  Fig.  25,  it  can 
be  shown  that  the  operation  count  per  input  bye  required  for  computing  the  check  tuple  for  the  (non-CRC) 
16-bit  code  C*  is  e*  =  96/16  =  6  (which  is  much  smaller  than  e&  =  45.5  of  the  basic  CRC-16  above).  To 
summarize,  the  (non-CRC)  ft*-bit  code  C*  (e.g.,  with  ft*  =  4,  8,  16,  64,  or  128  check  bits)  constructed  from 
a  primitive  trinomial  and  an  overall  parity  check  bit  has  (a)  the  optimal  error-detection  capability  and  (b)  a 
fast  bitwise  implementation.  Note  that,  as  discussed  later  in  Section  C.3,  other  fast  and  optimal  codes  can 
also  be  constructed  from  polynomials  different  from  trinomials. 
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1 

W  =  0; 

2 

5  =  0; 

3 

for  (0  <  i  <  n) 

4 

{  " 

5 

A  =  B  +  Qi\ 

6 

B  =  Rn  [A(X2  +  X )] ; 

7 

W  =  W  +  Qi-, 

8 

} 

9 

W  =  W  +  B; 

10 

P  =  B  +  parity  (IT); 

11 

return  P; 

Fig.  25  Algorithm  for  computing  the  fast  (ft  +  l)-bit  non-CRC  code  from  the  ft.-bit  CRC  (generated  by 

Xh  +  X  +  1)  and  an  overall  parity  bit 


APPENDIX  C  APPLICATION  OF  THE  NEW  TECHNIQUE  TO  GENERAL  GENERATOR 
POLYNOMIALS 

So  far,  we  apply  the  new  technique  (15)  to  the  polynomials  Xh+X2  +  X+1,  Xh  +  X  +  1,  and  Xh+1  to  yield 
fast  CRCs.  In  this  appendix,  we  apply  the  same  technique  to  more  general  generator  polynomials,  and  then 
determine  the  conditions  under  which  the  new  technique  is  faster  than  the  basic  technique.  In  particular,  we 
show  later  in  Section  C.  1.2.1  that,  when  applied  to  the  CR.C-64-ISO  generated  by  X64  +  X4  +  X3  +  X  +  1,  the 
new  technique  is  15  times  faster  than  the  basic  technique.  This  appendix  presents  only  bitwise  algorithms. 
Consider  an  ft- bit  CRC  that  is  generated  by  a  general  polynomial 

F(X)  =Xh  +  Xik  +  X ifc-1  +  •  •  •  +  +  1 

(56) 

=  Xh  +  H(X ) 


where  k  >  0,  ft  >  >  ik-i  >  ■  ■  ■  >  ii  >  0,  and 

H(X)  =  Xik  +  Xik~x  +  ■  ■  ■  +  Xh  +  1  (57) 

Note  that  ifc  >  k  >  0,  ik  =  degree(ift(A)),  and  k  =  weight  of  (H(X)  +  1).  Here,  we  have  F(X)  yf  Xh  +  1 
because  i\  >  0,  i.e.,  H(X)  yf  1.  The  case  F(X)  =  Xh  +  1  is  already  discussed  in  Section  B.l,  where  it  is 
shown  that  the  CRC  reduces  to  the  block-parity  checksum. 

The  h- bit  CRC  generated  by  (56)  can  be  computed  either  by  the  basic  technique  (see  Definition  1)  or  by 
the  new  technique  (15).  Recall  that  CRC  complexity  refers  to  the  operation  count  per  input  byte  (denoted 
by  eft  and  e/  for  the  basic  and  the  fast  CRCs,  respectively)  required  for  computing  the  CRC  check  tuple. 
Again,  we  assume  that  the  CRCs  are  implemented  in  C,  and  the  operations  are  counted  according  to  rules 
(R.l)  and  (R.2)  stated  in  Appendix  A. 


C.l  General  CRC  Generator  Polynomials 

First,  suppose  that  the  basic  technique  is  used  to  compute  the  check  tuple  of  the  CRC  generated  by  (56), 
i.e.,  B(X)  is  computed  as  in  Definition  1  with  M ( X )  =  F(X)  in  (11).  From  (37),  we  have 


f  8(4  +  5.5s)/s  if  s  <  h 
(  8(3  +  5.5s)/s  if  s  >  h 


(58) 


Next,  suppose  that  the  new  technique  is  used  to  compute  the  check  tuple  of  the  CRC  generated  by  (56). 
By  letting  M(X)  =  F(X)  in  (15),  we  have 


Ron  =  / [ Mx)(xh  +  f(x))]  if  s<h 
1  R  N(X)  [A(X)(X°  +  N(X))}  if  s>h 

where  N(X)  =  F(X)Xs~h  and  clegree(A(A))  <  s.  Substituting  (56)  into  (59),  we  have 


(59) 


R(Y,  {Rf(x)[A(X)H(X)}  if  s  <  ft 

1  j  l  Riv(x)  [A{X)H{X)Xs~h]  if  s  >  ft 


(60) 
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To  briefly  illustrate  the  main  idea,  consider  the  special  case  s  =  h.  Then  B(X )  =  R.^y)  [A(X)Xh]  under 
the  basic  technique,  and  B(X )  =  R^(x)  [A(X)(Xlk  +  X lk~1  +  •  •  •  +  XH  +  1)]  under  the  new  technique. 
Intuition  suggests  that  computing  B{X)  via  the  new  technique  is  faster  than  the  basic  technique  if  ik  is 
sufficiently  small.  More  precise  conditions  on  A  are  given  in  the  following. 

Let  e  be  the  operation  count  per  input  byte  required  for  computing  the  CR.C  check  tuple  under  the  new 
technique  (60).  We  have  e  =  e/  for  the  special  case  F(X)  =  Fh{X)  =  Xh  +  X2  +  X  +  1.  Although  Fig.  9 
shows  that  e/  <  e*,,  it  may  not  be  the  case  that  e  <  ej,  for  the  more  general  polynomial  F(X).  Thus,  in  the 
following,  we  determine  the  conditions  on  F(X)  so  that  e  <  ej,  or  e^/e  >  1  (i.e. ,  the  conditions  under  which 
the  new  technique  is  faster  than  the  basic  technique) .  Thus,  the  new  technique  serves  as  a  faster  alternative 
to  the  basic  technique  when  e^/e  >  1.  Before  continuing,  we  present  the  following  remarks,  which  contain 
some  results  that  will  be  used  later  to  determine  the  operation  count  required  for  computing  B{X). 

Remark  11.  Let  r'  be  the  number  of  operations  required  for  computing 

B\X)  =  A(X)(Xjn  +  ■  ■  ■  +  Xjl  +  1) 

=  A(X)Xjn  +  ■  ■  ■  +  A(X)Xjl  +  A(X) 

where  it  is  assumed  that  the  tuple  A(X)XJi  can  be  stored  in  a  single  computer  word.  Computing  A(X)X 0i 
is  then  equivalent  to  shifting  A(X)  to  the  left  by  ji  bits,  which  can  be  done  by  a  single  operation  on  most 
computers.  Thus,  for  a  given  A{X),  we  can  compute  B'(X )  by  using  n  left-shift  operations  and  n  addition 
operations.  We  then  have  r'  =  2 n.  □ 

Remark  12.  Let  r*  be  the  number  of  operations  required  for  computing 

B*{X)  =  Rm,(x)  [A*(X)(X^  +  ...  +  X*<+  1)] 

=  R m*(x)  [A*(X)X^]  +  •  •  •  +  R M*(x)  [A*(X)Xi«]  +  RM*(x)  [A*(X)] 

where  n>  q,  and  Rm*(2C)  [A*(A)AJi]  ^  A*(X)X:>i,  i.e.,  the  polynomial  division  is  needed. 

Define 

Cq-i(X)=  Rm.(x)  [A*(X)] 

Cg(X)  =  R M-(X)  [Cq-  1(X)X*<] 

Cq+i(X)  =  Rm,{x)  [Cq(X)Xi«+'-i«] 

Cm+q(X)  =  Rm.(x)  [Cm+q-1(X)X^~^-k] 

Cn(X)  =  Rm*(X)  [Cn.1(X)X^-k] 

Let  ro  be  the  operation  count  required  for  computing  Cq-i(X).  Given  Cq-\(X),  the  term  Cq(X)  = 
R.m«(X)  [Cg_ i{X)Xii\  can  be  computed  with  5.5 jq  operations,  and  so  on.  Given  Cn^i(X),  the  final  term 
Cn(X)  =  R. M-(x)  [Cn.1(X)X^~^]  can  be  computed  in  5.5 (jn  —  jn-i)  operations.  Thus,  computing 
Cq-i(X),Cq(X), . . . ,  Cn(X)  altogether  requires 

ro  +  5.5 jq  +  5.5(jq+i  —  jq)  H - h  5.5(jn  -  jn- 1)  =  r0  +  5.5j„ 

operations.  Because  B*{X)  =  Cq-\{X)  +  Cq(X)  +  •  •  •  +  Cn(X),  the  tuple  B*{X)  is  computed  by  using 
(n  —  q  +  1)  addition  operations.  Overall,  B*(X)  can  be  computed  with 

r*  =  5.5  jn  +  n-q  +  l  +  r0 

operations,  where  r0  is  the  operation  count  required  for  computing  R m*(X)  [A*(X)].  □ 
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C.1.1  Case:  s  >  h 

In  this  case,  according  to  Fig.  10,  the  new  technique  uses  Algorithm  4  (shown  in  Fig.  4),  which  contains  the 
computation  of  B(X).  From  (57)  and  (60),  we  have 


B(X)  =  RN(X)  [A{X)H{X)Xs~h} 

=  R N(X)  [A*(X)X^]  +  Rjv(x)  \A*(X)Xik~1]  +  •  •  •  +  RN(X)  [A*{X)X^]  +  RN(X)  [A*(X)] 


where  A*{X)  =  A{X)Xs~h .  Using  Remark  3,  it  can  be  shown  that  Rjv(x)  [A*(X)]  =  Rjv(x)  [A(A)Xs_ft]  can 
be  computed  with  r0  =  5.5(s  —  h)  operations  (see  Appendix  C).  Applying  Remark  12  with  M*(X)  =  N(X), 
n  =  k,  jn  =  ik,  q  =  1,  and  ro  =  5.5(s  —  h),  the  tuple  B(X)  can  be  computed  with  r  =  5.5(s  —  h  +  ik)  +  k 
operations. 

Fig.  13  shows  that  x  =  3  for  s  >  h  under  Algorithm  4.  By  substituting  the  values  of  x  and  r  into  (33), 
the  operation  count  per  input  byte  required  for  computing  the  check  tuple  under  the  new  technique  is 


8  [3  +  5.5(s  —  h  +  ik)  +  k ] 
s 

Using  (58)  and  (61),  we  have 

eb  3  +  5.5s 

e  3  +  5.5(s  —  h  +  ik)  +  k 


(61) 

(62) 


Thus,  the  new  technique  is  faster  than  the  basic  technique  when  e&/e  >  1,  i.e.,  3+5. 5s  >  3+5.5 (s  —  h+ik)  +  k, 
which  is  equivalent  to 

,  k 

<  h  -  — 

5.5 


where  ik  =  degree(17(A))  and  k  =  weight  of  ( H{X )  +  1). 

Remark  13.  Suppose  that  F{X)  is  either  Xh  +  X2  +  X  +  1  or  Xh  +  X  +  1.  Then  ik  <  2,  i.e.,  ik  is  a 
very  small  value.  Thus,  it  is  appropriate  to  use  loop  unrolling  in  the  calculation  of  Cm(X)  above.  Then 
(61)reduces  to  e  =  8 [3  +  5.5(s  —  h)  +  3.5 ik  +  fc]/s,  and  then 


eb  3  +  5.5s 

e  3  +  5.5(s  —  h)  +  3.5  ik  +  k 


(63) 


Note  that  it  is  common  to  choose  s,h  £  {8, 16, 32, 64},  i.e.,  the  typical  values  of  s  and  h  are  not  very  small, 
even  when  ik  is  very  small. 

For  example,  suppose  now  that  s  =  h  and  F(X)  =  Fh{X)  =  Xh  +  X2  +  X  +  1,  i.e.,  k  =  ik  =  2  and 
e  =  e/.  Substituting  s  =  h  and  k  —  ik  =  2  into  (63),  we  have 


as  previously  shown  in  (29). 


eb  _  e& 
e  ef 


3  + 5.5/i 
3  +  3.5  x  2  +  2 


0.25  +  0.458/i 


□ 


C.1.2  Case:  s  <  h 

In  this  case,  according  to  Fig.  10,  the  new  technique  uses  Algorithm  3  (shown  in  Fig.  3),  which  contains  the 
computation  of  B(X).  From  (60),  we  have  B(X)  =  Rp(x)  [A(X)H(X)\.  From  Fig.  13,  we  have 

[6  if  h  =  8,16,32,64 
(7  if  8,16,32,64 


By  substituting  the  values  of  x  into  (33),  the  operation  count  per  input  byte  required  for  computing  the 
CRC  check  tuple  under  the  new  technique  is 


f  8(6 +  r)/s  if  h  =  8, 16, 32, 64 
}  8(7 +  r)/s  if  8, 16, 32, 64 


where  r  is  the  number  of  operations  required  for  computing  B(X)  =  R.F(X)  [A{X)H{X)\.  From  (58),  we 
have  eb  =  8(4  +  5.5s) /s  for  s  <  h,  which  is  used  with  (64)  to  yield 


eb  _  /  (4  +  5.5s)/(6  +  r) 
(4  +  5.5s)/(7  +  r) 


if  h  =  8, 16, 32, 64 
if  h  ^  8, 16, 32, 64 


(65) 


where  r,  which  depends  on  whether  ik  <  h  —  s  or  ik  <  h  —  s,  is  computed  in  the  following  subsections. 
As  seen  below,  the  condition  ik  <  h  —  s  implies  that  B(X)  =  RF(x)  [A(X)iJ(X)]  =  A(X)H(X),  i.e.,  the 
polynomial  division  is  eliminated. 
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C.  1.2.1  Case:  s  <  h.  and  ik  <  h  —  s 

Because  degree(A(A))  <  s  and  clegr ee(B(X))  =  ik ,  we  have  degree(A(A)iL(A))  <  s  +  ik ■  The  assumption 
ik  <  h  —  s  then  implies  that  degree(A(A)iL(A))  <  h.  Thus,  B{X)  =  Rf(A')  [A{X)H(X)\  =  A(X)H(X), 
i.e. ,  the  polynomial  division  is  eliminated.  Let  r  be  the  number  of  operations  required  for  computing 
B(X)  =  A(X)H(X).  Using  (57),  we  have 

B{X)  =  A(X)Xik  +  A(X)Xik~1  +  ■  ■  ■  +  A(X)Xh  +  A{X) 

Applying  Remark  11,  we  then  have  r  =  2k,  which  is  substituted  into  (64)  and  (65)  to  yield 


I  8(6  +  2k) / s  if  h  =  8, 16, 32, 64 
\  8(7  +  2k) /s  if  h  ^  8, 16, 32, 64 


et  _  f  (4  +  5.5s)/(6  +  2k)  if  h  =  8, 16, 32, 64 
e  “  \  (4  +  5.5s)/(7  +  2k)  if  h  ^  8, 16, 32, 64 

Thus,  the  new  technique  is  faster  than  the  basic  technique  if  e^/e  >  1,  i.e., 


I  6  +  2 k  if  h  =  8, 16,  32,  64 
[7  + 2k  if  h  =/=  8, 16, 32, 64 


which  is  equivalent  to 


f  2.75s  -  1  if  h  =  8, 16, 32, 64 
X  2.75s  -1.5  if  h  ^  8, 16, 32, 64 


where  ik  =  degree(iL(A))  and  k  =  weight  of  {H{X)  +  1). 
For  example,  consider  the  CR.C-64-ISO,  generated  by 


F(X)  =  A64  +  X4  +  X3  +  X  +  1 


(66) 

(67) 


(68) 


Here,  we  have  h  =  64,  k  =  3,  and  ik  =  4.  Assume  that  s  <  h  —  ik  =  60.  Under  the  new  technique,  we  then 
have 

B(X)  =  R.fw  [A(X)(X4  +  X3  +  X  +  1)] 

=  A{X)X4  +  A(X)X3  +  A{X)X  +  A(X) 
i.e.,  the  polynomial  division  is  eliminated.  Substituting  k  =  3  into  (67),  we  have 

ei,  4  +  5.5s 

7  “  12 

For  the  special  case  s  =  32,  we  have  e^/e  =  15,  i.e.,  the  new  technique  is  15  times  faster  than  the  basic 
technique  for  the  CRC-64-ISO.  We  also  have  eb/e  =  15  when  s  =  32  for  a  64-bit  CRC  generated  by  a 
polynomial  that  has  the  following  more  general  form 

F(X)  =  X64  +  Xi3  +  Xi2  +Xh  +1 

where  32  >  *3  >  i2  >  i\  >  0. 

C.l.2.2  Case:  s  <  h  and  ik  >  h  —  s 

As  seen  below,  the  computation  of  B{X)  requires  the  polynomial  division  in  this  case.  The  assumption 
ik  >  h  —  s  implies  that  there  exists  in*  such  that  1  <  to*  <  k  ,  im*  >  h  —  s,  and  ij  <  h  —  s  for  all  j  <  in* . 
There  are  3  subcases  to  consider. 

Case  1:  1  <  to*  <  k.  By  letting  m  =  to*  —  1,  we  have 

F(X)  =  Xh  +  Xik  +■■■+  Aim+X  +  Xim  +  ■  ■  ■  +  X*1  +  1 


where  h  >  ik  >  im+ 1  >  >  *i  >  0,  im+i  >  h  —  s,  and  im  <  h  —  s. 
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Because  im  <  ft  —  s,  we  have  degree(A(X)Wn)  <  ft,  for  1  <  n  <  m.  Thus, 

R F(X)  [A(X)X^]  =  A(X)X^ 
for  1  <  n  <  m.  From  (60),  we  then  have 

B(X)  =  Rf(X)  [A{X)Xik]  +  •  •  •  +  R.f(x)  [A(X)Xim+1]  +  A{X)Xl™  +  •  •  •  +  A(X)XH  +  A(X) 
By  letting  A*{X)  =  A(X)Xlm+1 ,  we  can  write 

B(X)  =B1(X)  +  B2(X) 


where 


Bi(X)  =  Rf{x)  [A*{X)Xi-~i^]  +  •  ••  +  Rf{x)  [A*{X)Xi^~i^]  +Rnx)  [A*(X)} 

and 

B2(X)  =  A(X)Xim  +  •  •  •  +  A{X)X ^  +  A{X) 

Because  RF(X)  [^*(X)]  =  RF(X)  [d^JI^+i]  =  RF{X)  [(A(X)A'',-*)A'<ra+i-{fc-*)] ,  the  term 

R.F(X)  [A*(X)]  can  be  computed  with  r0  =  1  +  5.5[im+i  —  ( h  —  s)]  operations  for  a  given  A(X).  Using 
Remark  12,  B i(X)  can  be  computed  with 

T\  =  5.5(4  -  im+i)  +  k  -  (to,  +  2)  +  1  +  r0 
=  5.5[4  —  (h  —  s)]  +  fc  —  to 

operations.  Using  Remark  11,  B2(X)  can  be  computed  with  r2  =  2 to  operations.  Overall,  the  number  of 
operations  required  for  computing  B(X)  is 


=  5.5[4  —  {h  —  s)]  +  k  +  to,  +  1 


which  is  substituted  into  (65)  to  yield 


e& 

e 


(4  +  5.5s)/(6  +  5.5[4  —  (h  —  s)}  +  k  +  in  +  1)  if  h  =  8, 16, 32, 64 
(4  +  5.5s)/(7  +  5.5[4  -  (h  -  s)]  +  k  +  m  +  1)  if  h  ^  8, 16, 32, 64 


Thus,  the  new  technique  is  faster  than  the  basic  technique  when 


f  6  +  5.5[4  —  (h  —  s)]  +  fc  +  to  +  1  if  h  =  8, 16, 32, 64 
\  7  +  5.5(4  -  (h  -  s)]  +  k  +  m  +  1  if  ft  ^  8, 16, 32, 64 


(69) 


which  is  equivalent  to 


fft-  (3  +  /c  +  m)/5.5  if  ft  =  8, 16, 32, 64 
\ft-  (4  + fc  +  m)/5.5  if  ft  ^  8, 16, 32, 64 


where  4  =  degree(ift(X))  and  k  =  weight  of  {H{X)  +  1).  Recall  that  we  also  assume  that  ft  >  4  >  4  +i  > 
im  >  4  >  0,  im+ 1  >  ft  —  s,  and  im  <  ft  —  s. 

For  example,  consider  the  CR.C-32-IEEE  802.3  generated  by 


F(X)  =  X32  +  X26  +  X23  +  X22  +  X16  +  X12  +  X11  +  X10  +  X8  +  X7  +  X5  +  X4  +  X2  +  X  +  1  (70) 

i.e.,  ft  =  32,  k  =  13,  4  =  26.  Assume  that  s  =  16.  We  then  have  to  =  10.  Substituting  these  values  into 
(69)  yields  e^/e  =  92/85,  i.e.,  the  new  technique  is  slightly  faster  than  the  basic  technique. 

Case  2:  to*  =  1.  We  then  have 


F(X)  =  Xh  +  Xik  +■■■  +  Xh  +  1 
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where  i\  >  h  —  s.  We  have 

B(X)  =  Rnx)  [A{X)Xlk]  +  •  •  •  +  Rf{x)  [. A(X)X h]  +  RF(X)  [A(X)Xik]  +  A(X) 

=  Rf(x)  [A*(X)Xik~^]  +  •  •  •  +  RF(X)  [A*(X)X^]  +  Rf{x)  \A*(X) ]  +  A(X) 
=  B1{X)  +  A(X) 

where  A*(X)  =  A(X)Xn  and 


Bi(X)  =  Rf(x)  [A*{X)Xik~ik]  +  •  •  •  +  Rf(x)  [A*(X)X^]  +  Rf(x)  [A*(X)] 

Because  RF{X)  [A*(A)]  =  RF{X)  [. A{X)X 4l]  =  RF(X)  [(Apf)^-8)^1-^-8)] ,  the  termRF(x)  [A*(X)]  can 
be  computed  with  ro  =  1  +  5.5[*i  —  (h  —  s)]  operations  for  a  given  A(X).  Using  Remark  12,  B\{X)  can  be 
computed  with 

r\  =  5.5(ifc  —  ii)  +  k  —  2  +  1  +  r0 
=  5.5[4  —  (h  —  s)]  +  k 

operations.  Thus,  the  number  of  operations  required  for  computing  B(X)  is 

r  =  r  i  +  l 

=  5.5[4  —  (h  —  s)]  +  k  +  1 

which  is  substituted  into  (65)  to  yield 

et  f(4  +  5.5s)/(6  +  5.5[ifc-(/i-s)]  +  fc+l)  if  h  =  8, 16,32,64 

e  \(4  +  5.5s)/(7  +  5.5[ifc-(/i-s)]  +  fe+l)  if  h  ±  8, 16, 32, 64  1  ’ 

Case  3:  m*  =  k.  We  then  have 

F(X)  =  Xh  +  Xik  +  ■  ■  ■  +  X*1  +  1 

where  ik  >  h  —  s,  and  in  <  h  —  s  for  all  n  <  k.  We  have 

B(X)  =  Rf(x)  [A(X)Xik]  +  A(X)Xik~1  +  •  •  •  +  A(X)Xh  +  A(X) 

=  Rf{x)  [A(X)Xik]  +  B2{X) 

where 

B2(X)  =  A(X)Xik~1  +  •  •  •  +  A(X)Xh  +  A{X) 

Because  RF(x)  [A(A)XIfe]  =  RF(X)  [(A(X)Xh~s)Xlk~(-h~s')] ,  the  term  RF(x)  [A(X)XIfe]  can  be  com¬ 
puted  with  r0  =  1  +  5.5[ife  —  (h  —  s)]  operations  for  a  given  A(X).  Using  Remark  11,  B2(X)  can  be  computed 
with  r 2  =  2 (k  —  1)  operations.  Thus,  the  number  of  operations  required  for  computing  B(X)  is 


r  =  ro  +  r2  +  1 
=  5.5[4  —  ( h  —  s)]  +  2k 


which  is  substituted  into  (65)  to  yield 

e±  _  f(4  +  5.5s)/(6  +  5.5[ifc-(/i-s)]  +  2fc)  if  h  =  8, 16, 32, 64  ,  , 

e  “ \(4  +  5.5s)/(7  +  5.5[4-(/i-s)]  +  2fc)  if  8, 16, 32, 64 

For  example,  consider  the  CRC-32-IEEE  802.3  generated  by  (70),  i.e. ,  h  =  32,  k  =  13,  ik  =  26.  Assume 
that  s  =  8.  Substituting  these  values  into  (72)  yields  e^/e  —  48/43,  i.e.,  the  new  technique  is  slightly  faster 
than  the  basic  technique. 
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C.2  CRC  Generator  Polynomials  of  Weight  4 

We  now  consider  the  special  case  k  =  2,  i.e.,  F(X)  is  a  polynomial  of  weight  4: 

F(X)  =Xh  +  Xi2  +  Xn  +  1 

where  h  >  i%>  i\  >  0.  In  particular,  F(X)  =  Ff,.(X)  =  Xh  +  X2  +  X  +  1  when  z2  =  2  and  i\  =  1.  Fig.  26 
lists  some  weight-4  polynomials  F(X)  =  Xh  +  X 12  +  X11  +  1,  which  have  periods  that  are  greater  than 
those  of  Fh(X),  for  h  <  32.  Recall  from  Theorem  2  that  the  maximum  length  of  a  CRC  equals  the  period 
of  its  generator  polynomial.  In  the  following,  we  consider  the  application  of  the  new  technique  to  weight-4 
generator  polynomials  for  CRCs  such  as  CRC-16  and  CRC-CCITT.  For  brevity,  we  only  present  the  results 
for  s  <  h  (the  case  s  >  h  can  be  handled  similarly).  There  are  3  cases  to  consider. 

Case  1:  i2  <  h  —  s  (i.e.,  s  <  h  —  i2).  Using  the  new  technique,  we  have  B(X)  = 
Rf(jc)  [A(X)(X*2  +  Xh  +1)]  =  A(X)(X*2  +  XH  +  1),  i.e.,  the  polynomial  division  is  eliminated.  Sub¬ 
stituting  k  =  2  into  (66),  we  have 


_  f  80/s  if  h  =  8, 16, 32, 64  ,  . 

6  _  (  88/s  if  h  X  8, 16, 32, 64  ( 

By  comparing  (73)  with  (39),  we  have 

e  =  ef  (74) 

for  s  <  h  —  *2-  Using  k  =  2  and  s  <  h  — i2  in  (68),  it  can  be  shown  that  the  new  technique  is  faster  than  the 
basic  technique  when 

2  <s<h-i2  (75) 

For  example,  let  F(X)  =  X32  +  X4  +  X  +  1,  i.e.,  h  =  32  and  z2  =  4.  It  follows  from  (75)  that  the  new 
technique  is  faster  the  basic  technique  when  2  <  s  <  28.  Under  this  condition,  we  have 

B(X)  =  A{X)(X4  +  X  +  1)  (76) 

i.e.,  the  polynomial  division  is  eliminated.  Fig.  26  shows  that  the  32-bit  CRC  generated  by  F(X)  has 
the  maximum  length  of  2,147,483,647  =  231  —  1  bits  («  268,435,456  bytes).  Recall  from  Fig.  5  that  the 
original  fast  32-bit  CRC,  generated  by  FhXX)  =  X32  +  X2  +  X  +  1,  has  the  maximum  length  of  2,097,151 
R8  (231  —  1)/1024  bits  («  262,143  bytes).  Thus,  the  maximum  length  of  the  CRC  generated  by  F(X)  is 
substantially  larger  than  that  of  the  fast  CRC  generated  by  F3 2(X).  However,  (74)  shows  that  these  2  CRCs 
have  identical  complexity  when  s  <  28. 

Consider  the  12-bit  CRC  generated  by  F(X)  =  X12  +  X3  +  X  +  1.  Fig.  26  shows  that  this  CRC 
has  the  maximum  length  of  2,046  bits,  which  is  much  larger  than  that  of  the  fast  CRC  generated  by 
F\2{X)  =  X12  +  X2  +  X  +  1,  which  has  the  maximum  length  of  only  595  bits  (see  Fig.  5).  However,  (74) 
shows  that  these  2  CRCs  have  identical  complexity  when  s  <  9. 

Case  2:  *2  >  h  —  s  and  i\  <  h  —  s.  Using  the  new  technique,  we  have  B(X)  = 

Rf{X)  [A(X){Xi2  +  X^  +  1)]  =  Rf(X)  [A{X)Xi2]  +  A{X)X^  +  A(X).  Substituting  k  =  2  into  (67)  yields 

et  _  I  (4  +  5.5s) /(10  +  5.5 [i2  -  (h  -  s)})  if  h  =  8, 16, 32, 64 

e  “  \  (4  +  5.5s) /(ll  +  5.5 [i2  -  (h  -  s)])  if  h  ^  8, 16, 32, 64 

For  example,  consider  the  CRC-CCITT  generated  by  F(X)  =  X16  +  X12  +  X 5  +  1,  i.e.,  h  =  16,  i2  =  12, 
and  <1  =  5.  Assume  that  s  =  8.  We  then  have  e^/e  =  (4  +  5.5  x  8) / (10  +  5.5  x  4)  =  48/32  =  1.5.  Thus,  for 
the  16-bit  CRC-CCITT,  the  new  technique  is  50%  faster  than  the  basic  technique. 

Next,  consider  the  CRC-16  generated  by  F{X)  =  X16  +  A'15  +  X2  +  1,  i.e.,  h  =  16,  z2  =  15,  and  i\  =  2. 

Assume  also  that  s  =  8.  We  then  have  et/e  =  (4  +  5.5  x  8)/(10  +  5.5  x  7)  =  48/48.5.  Thus,  for  the  CRC-16, 

the  new  technique  is  slightly  slower  than  the  basic  technique. 

Case  3:  i\  >  h  —  s.  Using  the  new  technique,  we  have  B(X)  =  RF(x)  [A(X)(X*2  +  X11  +1)]  = 
RF(x)  [A(X)X*2]  +RF(A-)  [A(X)Xn]  +  A(X).  Substituting  k  =  2  into  (71)  yields 

e±_{  (4  +  5.5s) /(9  +  5.5[z2  -  (h  -  s)])  if  h  =  8, 16, 32, 64 
e  “  1  (4  +  5.5s) /(10  +  5.5 [i2  -  (h  -  s)])  if  h  ^  8, 16, 32, 64 
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Xh  +  +  A*i  +  1 

period 

2h~1-l 

period 

X5  +  X3  +  X  +  1 

15 

i 

X7  +  X3  +  X2  +  1 

62 

1.01613 

A7  +  X4  +  X2  +  1 

63 

1 

X9  +  X5  +  X3  +  1 

255 

1 

A'10  +  A3  +  X2  +  1 

511 

1 

X11  +  A3  +  A  +  1 

1023 

1 

A12  +  X3  +  X  +  1 

2046 

1.00049 

A12  +  X7  +  A'2  +  1 

2047 

1 

A15  +  A3  +  A2  +  1 

16382 

1.00006 

A15  +  A5  +  A3  +  1 

16383 

1 

A17  +  A3  +  A  +  1 

63457 

1.03275 

A17  +  A4  +  A3  +  1 

65534 

1.00002 

A17  +  A10  +  A4  +  1 

65535 

1 

A18  +  A5  +  A2  +  1 

98301 

1.33336 

A18  +  A5  +  A4  +  1 

131071 

1 

A19  +  A3  +  A2  +  1 

262142 

1 

A19  +  A5  +  A3  +  1 

262143 

1 

A20  +  A4  +  A3  +  1 

521985 

1.00441 

A20  +  A7  +  A5  +  1 

524286 

1 

A20  +  A11  +  A2  +  1 

524287 

1 

A21  +  A3  +  A  +  1 

1048575 

1 

A22  +  A3  +  A  +  1 

491460 

4.26719 

A22  +  A3  +  A2  +  1 

2094081 

1.00147 

A22  +  A7  +  A4  +  1 

2097151 

1 

A23  +  A3  +  A  +  1 

4161409 

1.0079 

A23  +  A6  +  A  +  1 

4194300 

1 

A23  +  A7  +  A6  +  1 

4194302 

1 

A23  +  A8  +  A2  +  1 

4194303 

1 

A25  +  A3  +  A  +  1 

4194303 

4 

A25  +  A4  +  A  +  1 

7864260 

2.13335 

A25  +  A4  +  A3  +  1 

12070842 

1.3899 

A25  +  A5  +  A  +  1 

16766977 

1.00061 

A25  +  A6  +  A3  +  1 

16777212 

1 

A25  +  A9  +  A2  +  1 

16777214 

1 

A25  +  A14  +  A2  +  1 

16777215 

1 

A26  +  A3  +  A  +  1 

32505732 

1.03226 

A26  +  A4  +  A  +  1 

33554431 

1 

A27  +  A3  +  A2  +  1 

67108862 

1 

A27  +  A5  +  A  +  1 

67108863 

1 

A28  +  A3  +  A  +  1 

97517382 

1.37635 

A28  +  A5  +  A2  +  1 

134217727 

1 

A29  +  A11  +  A  +  1 

268435455 

1 

A30  +  A3  +  A  +  1 

536870908 

1 

A30  +  A7  +  A6  +  1 

536870911 

1 

A31  +  A3  +  A2  +  1 

50133510 

21.4176 

A31  +  A4  +  A  +  1 

1073213442 

1.00049 

A31  +  A6  +  A2  +  1 

1073602561 

1.00013 

A31  +  A6  +  A3  +  1 

1073741822 

1 

A31  +  A12  +  A2  +  1 

1073741823 

1 

A32  +  A3  +  A  +  1 

21691754 

99 

A32  +  A3  +  A2  +  1 

22362795 

96.0293 

A32  +  A4  +  A  +  1 

2147483647 

1 

Fig.  26  The  period  of  Xh  +  X12  +  Xn  +  1 


C.3  CRC  Generator  Polynomials  of  Weight  3 

We  now  consider  the  special  case  k  =  1,  i.e.,  F(X)  is  a  polynomial  of  weight  3:  F(X) 
defining  *  =  *i,  we  have 

F(X)  =Xh  +  Xi  +  1 


Xh  +  Xh  +  1.  By 
(77) 
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where  ft  >  i  >  0.  Note  that  F(X)  =  Th(X )  =  Xh  +  X  +  1  for  the  special  case  *  =  1.  Fig.  27  lists  some 
weight-3  polynomials  along  with  their  periods,  for  ft  <  31.  In  Section  B.2,  the  fast  ft*- bit  perfect  codes  are 
constructed  from  the  CRCs  generated  by  Th{X)  for  ft*  =  4,  8, 16,  64, 128,  where  ft*  =  h+1.  In  the  following, 
we  show  that  other  fast  perfect  codes  can  also  be  constructed  from  CRCs  generated  by  weight-3  polynomials. 

Let  C  be  the  ft.-bit  CRC  generated  by  F(X)  in  (77).  Recall  from  Theorem  2  that  the  maximum  length 
of  C  equals  the  period  of  F(X).  Assume  that  s  <  ft  —  i.  Using  the  new  technique,  we  have  B(X)  = 
’Rp(x)  [A(X)(X*  +  1)]  =  A{X){X1  +  1),  i.e. ,  the  polynomial  division  is  eliminated.  Let  e  be  the  operation 
count  per  input  byte  required  for  computing  the  check  tuple  P(X)  of  the  ft- bit  CRC  C.  Then  e  is  given 
by  (66). 

Let  C*  be  the  non-CRC  code  that  is  constructed  by  adding  an  overall  parity  bit  to  the  ft- bit  CRC  C, 
and  P*(X)  be  the  check  tuple  of  C* .  Let  e*  be  the  operation  count  per  input  byte  required  for  computing 
P*(X).  Note  that  P*(X)  has  h  +  1  bits,  which  can  be  computed  by  an  algorithm  that  is  similar  to  Fig.  25. 
Using  (66)  and  Fig.  25,  it  can  then  be  shown  that 


f  8(7  +  2k) /s  if  h  =  8, 16, 32, 64 
\  8(8  +  2k) /s  if  h  ^  8, 16,  32, 64 


Substituting  k  =  1  into  (78),  we  have 


I  72/ s  if  h  =  8, 
|  80/s  if  h  +  8, 


Let  us  now  compare  the  speed  of  the  ( h  +  l)-bit  code  C* 
by  Fh+i(X)  =  Xh+1  +  X2  +  X  +  1.  From  (39),  we  have 


16,32,64 

16,32,64 


(79) 


with  that  of  the  fast  ( h  +  l)-bit  CRC  generated 


f  80/s  if  ft +1  =  8, 16, 32, 64 
|  88/s  if  h+  1^8,16,32,64 


(80) 


for  s  <  ft.  By  comparing  (79)  with  (80),  we  have 


*  ^ 

e  <ef 


(81) 


for  s  <  ft  —  i.  Thus,  (81)  shows  that  the  (ft  +  l)-bit  non-CRC  code  C*  is  at  least  as  fast  as  the  fast  (ft+l)-bit 
CRC  generated  by  Fh+i(X ),  i.e.,  C*  is  also  a  fast  code  for  s  <  ft  —  i.  Further,  at  its  maximum  length,  the 
code  C*  is  the  ( 2h,2h  —  ft  —  1,4)  extended  Hamming  perfect  code,  provided  that  F(X)  =  Xh  +  X1  +  1  is  a 
primitive  polynomial  (i.e.,  its  period  is  2h  —  1). 

For  example,  let  F(X)  =  X11  +  X2  + 1,  i.e.,  ft  =  11  and  i  =  2.  Fig.  27  shows  that  F(X)  is  primitive.  Let 
C  be  the  11-bit  CRC  generated  by  F(X).  The  non-CRC  code  C*,  which  is  constructed  by  adding  an  overall 
parity  bit  to  C,  is  the  (2048,2036,4)  extended  Hamming  perfect  code.  Note  that  both  C  and  C*  are  fast  if 
we  choose  s  <  ft  —  *  =  9.  Suppose  that  we  choose  s  =  8.  From  (79)  and  (80),  we  then  have  e*  =  80/8  =  10 
and  e/  =  88/8  =  11,  i.e.,  e*  <  e/.  Thus,  for  s  =  8,  the  non-CRC  12-bit  code  C*  is  faster  than  the  fast  12-bit 
CRC  generated  by  Fi2(X)  =  X12  +  X2  +  X  +  1.  Further,  the  maximum  length  of  the  non-CRC  code  (which 
is  2,048  bits)  is  also  much  longer  than  that  of  the  fast  CRC  generated  by  Fi2(X)  (which  is  595  bits),  and  2 
bits  longer  than  that  of  the  CRC  generated  by  F(X)  =  X12  +  X3  +  X  +  1  (which  is  2,046  bits,  as  discussed 
in  Section  C.2). 

Similarly,  we  can  construct  a  32-bit  extended  Hamming  perfect  code  C*  by  adding  an  overall  parity  bit 
to  the  CRC  C  generated  by  F(X)  =  X31  +  X3  +  1  (see  Fig.  27).  We  have  ft  =  31  and  1  =  3.  Both  C  and 
C*  are  fast  if  we  choose  s  <  ft  —  i  =  28. 
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Xh  +  TO  +  1 
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7 

i 

X4  +  X  +  1 

15 

i 
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21 
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31 

1 

X6  +  X  +  1 

63 

1 
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1 
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63 
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X9  +  X  +  1 

73 

7 
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X11  +  A'2  +  1 

2047 

1 
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11811 
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4.01618 

X23  +  X2  +  1 
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32.0469 
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15 
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X29  +  X2  +  1 
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1 
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99 
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1 

Fig.  27  The  period  of  Xh  +  Xi  +  1 
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APPENDIX  D  CRC  PARALLEL  IMPLEMENTATION 

Given  a  CRC,  which  is  generated  by  a  polynomial  M(X)  of  degree  h ,  our  goal  is  to  compute  the  check 
/i-tuple  P(X)  to  protect  an  input  message  U(X)  =  (Qo(X), . . . ,  Qn-i(X)),  where  Qi(X)  is  an  s-tuple. 

So  far,  it  is  implicitly  assumed  that  the  CRC  algorithms  are  for  sequential  implementation.  That  is,  the 
entire  input  message  U{X)  is  supplied  to  a  single  processor  of  a  computer,  and  the  output  P{X)  is  then 
computed  by  this  same  processor.  Following  the  technique  in  [6],  we  can  modify  these  CRC  algorithms  for 
parallel  implementation  on  k  different  processors  of  a  computer,  k  >  1,  as  follows. 

First,  the  input  message  U(X)  is  divided  into  k  sub-messages  E0(X),  •  •  •  ,Ek-i(X),  i.e., 

U(X)  =  (E0(X),...,Ek_1(X)) 

where  Ei(X)  consists  of  rii  s-tuples.  Thus,  n  =  no  +  ■  ■  ■  +  nk- 1-  Define 

Wx(X)  =  R M(x)  bcC"*+i+-+"*-0»l  (82) 

for  0  <  i  <  k  —  2,  and  Wk- i(X)  =  1.  Note  that  Wt ( X )  is  computed  from  X^"i+1_l  fnfe-i)s,  which  is  used 
to  determine  the  relative  position  of  sub-message  Et(X)  in  U(X)  (see  Remark  14). 

Next,  for  each  i  =  0, 1, . . . ,  k  —  1,  input  sub-message  Ei(X)  is  supplied  to  processor  i,  which  is  used  to 
compute  the  following  h- tuples: 

Pi(X)  =  R M(x)  [Ei(X)Xh]  (83) 

Zi(X)  =  Rm(x)  [PtWWiiX)}  (84) 

where  Wi: ( X )  is  defined  by  (82).  Note  that  Pi{X)  is  the  CRC  check  tuple  computed  by  processor  i  for 
sub-message  Ei(X).  For  each  i  =  0, 1, . . . ,  k  —  1,  we  assume  that  processor  i  computes  P%{X)  and  Zi(X)  in 
(83)  and  (84),  independent  of  other  processors,  i.e.,  the  computation  is  done  in  parallel  by  the  k  processors. 
Theorem  5.  The  tuples  Zj  (X),  0  <  i  <  k,  which  are  computed  in  parallel  by  the  k  processors,  are  combined 
to  yield  the  final  CRC  check  /i-tuple  P(X)  for  the  entire  input  message  U(X),  i.e., 

k- 1 

P{X)  =  YJZi{X)  (85) 

i= 0 

Proof.  In  polynomial  notation,  we  have 

k—2 

U(X)  =  Ez(X)X^ni+1+-+nk-l)s  +  Ek_ i(X) 

2=0 

The  CRC  check  tuple  P(X)  for  U(X)  then  becomes 
P{X)  =  Rm{x)  [U(X)Xh] 

k—2 

=  ERm(*)  [Ei(X)XhX^+-+nk-'>]  +Rm(x)  [Ek- i(X)Xh] 

2=0 

k—2 

=  YRM(X)  [P(X)Wl(X)}  +  Pk-!(X) 

2  =  0 
k- 1 

=  YRM(X)  [P(X)Wl(X)} 

2  =  0 
k- 1 

i=0 

□ 
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We  now  determine  the  total  CRC  computation  time,  denoted  by  ttotai,  for  the  parallel  technique.  First, 
let  twi,  tPi,  and  tZi  be  the  times  for  processor  i  to  compute  Wi(X),  Pi(X),  and  Zi(X),  respectively.  Let 
tp  be  the  time  for  the  computer  to  compute  the  summation  (85).  We  can  consider  t-Wi,  tZi,  and  tp  as  the 
overhead  costs  for  the  CRC  parallel  implementation.  Because  the  k  processors  compute  (83)  and  (84)  in 
parallel,  the  total  time  for  the  computer  to  compute  the  final  CRC  check  tuple  P(X)  is 


ftotai  =  tWi  +  ma x{tP.  +  tZi ,  0  <  *  <  k}  +  tP  (86) 

We  now  determine  the  speedup  factor  for  the  parallel  technique  under  the  following  ideal  conditions: 
(a)  the  k  processors  have  identical  computational  capability,  (b)  the  sub- messages  Et(X)  have  the  same 
length,  i.e. ,  ni  =  n/k,  and  (c)  the  overhead  costs  tuy,  tzt,  and  tp  are  negligible  compared  to  tpt,  i.e., 
tWi  +  tPi  +  tZi  +  tp  «  tpt  (see  Remark  14).  From  (86),  we  then  have  ftotai  ~  tpi  ~  tp/k,  where  tp  denotes 
the  time  for  a  single  processor  to  compute  the  CRC  check  tuple  P(X)  for  the  entire  message  U(X),  i.e.,  tp  is 
the  CRC  computational  time  for  sequential  implementation.  Thus,  under  the  ideal  conditions,  the  speedup 
factor  is  approximately  k  for  parallel  implementation. 

Remark  14.  Under  the  CRC  parallel  implementation,  processor  i  computes  Wi(X),  Pj(X),  and  Zi(X)  as 
given  in  (82)-(84),  i  =  1, . . . ,  k  —  1.  These  tuples  can  be  computed  as  follows.  First,  it  can  be  shown  from 
(82)  that 

Wi(J f)  =  Rm(x)  (Xn^sWi+1(X)} 

with  Wk- i  =  1.  Thus,  once  Wi+\(X)  is  known,  Wi(X)  can  be  computed  in  0(n;+is)  steps  (by  Remark  1). 
We  can  also  write  Wi(X)  =  RM(x)  [Xn^s~hWi+l{X)Xh] ,  i.e.,  we  can  view  Wi(X)  as  the  output  check  tuple 
of  the  CRC  generated  by  M ( X )  when  Xni+lS~hWi+ i(X)  is  the  input  tuple.  Thus,  Wi(X)  can  be  computed 
by  either  the  CRC  basic  technique  or  the  CRC  new  technique.  Suppose  now  that  no, . . . ,  n^_ i  are  known 
and  fixed.  The  tuples  IFo(X),  W\(X), . . . ,  Wk-i  can  then  be  stored  in  a  table  defined  by  T[i ]  =  Wi(X), 
i  =  0,1,...,  fe  —  1  (cf.  [6]).  Next,  processor  i  can  use  either  the  basic  technique  or  the  new  technique  to 
compute  the  (partial)  CRC  check  tuple  P.^X).  Further,  using  the  technique  “Mimic  long  multiplication  as 
done  by  hand”  in  [11,  p.  90],  it  can  be  shown  that  the  tuple  Zi(X)  =  Rm(a')  [Pj(X)Wj(X)]  can  be  computed 
in  O(h)  steps.  Finally,  once  Z0(X ), . . . ,  Zk-i(X)  are  computed  by  the  k  processors,  their  summation  in  (85) 
can  be  quickly  computed.  Thus,  for  a  sufficiently  long  sub-message  Ei(X)  along  with  the  use  of  table  lookup 
for  determining  Wi(X),  the  computational  complexity  of  Pi(X)  is  much  greater  than  that  of  Wi(X),  Zi(X ), 
and  the  summation  (85),  i.e.,  tpi  »  twi,tzi,tp.  □ 
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